Generalizing PageRank: Damping Functions for Link-Based Ranking Algorithms Ricardo Baeza-Yates ricardo@baeza.cl ABSTRACT This paper introduces a family of link-based ranking algorithms that propagate page importance through links. In these algorithms there is a damping function that decreases with distance, so a direct link implies more endorsement than a link through a long path. PageRank is the most widely known ranking function of this family. The main objective of this paper is to determine whether this family of ranking techniques has some interest per se, and how different choices for the damping function impact on rank quality and on convergence speed. Even though our results suggest that PageRank can be approximated with other simpler forms of rankings that may be computed more efficiently, our focus is of more speculative nature, in that it aims at separating the kernel of PageRank, that is, link-based importance propagation, from the way propagation decays over paths. We focus on three damping functions, having linear, exponential, and hyperbolic decay on the lengths of the paths. The exponential decay corresponds to PageRank, and the other functions are new. Our presentation includes algorithms, analysis, comparisons and experiments that study their behavior under different parameters in real Web graph data. Among other results, we show how to calculate a linear approximation that induces a page ordering that is almost identical to PageRank's using a fixed small number of iterations; comparisons were performed using Kendall's on large domain datasets. Categories and Subject Descriptors: H.4.m [Information Systems Applications]: Miscellaneous General Terms: Algorithms. Keywords: Link analysis, Link-based ranking, Web graphs. Paolo Boldi Carlos Castillo Universita di Roma "La Sapienza" ` Rome, Italy Yahoo! Research Universita degli Studi di Milano ` Barcelona, Spain & Santiago, Chile Milan, Italy boldi@dsi.unimi.it castillo@dis.uniroma1.it While for any given topic there might be thousands or even millions of pages available, the problem of ranking those pages to generate a short list is probably one of the key problems of Web IR, and this requires some kind of relevance estimation. One of the measures of importance of a scientific paper is the number of citations that the article receives. Following this idea, several authors proposed to use links for ranking web pages [28, 21, 25]; however, it quickly became clear that just counting the links was not a very reliable measure of authoritativeness (it was not in scientific citations either), because it is very easy to manipulate in the context of the web, where creating a page costs nearly nothing. The PageRank technique, introduced by Page et al. [31], actually tries to mend this problem by looking at the importance of a page in a recursive manner: "a page with high PageRank is a page referenced by many pages with high PageRank". The algorithm not only counts the direct links to a page, but also includes indirect links. The same is valid for scientific and bibliographic citations in general. PageRank is a simple, robust and reliable way to measure the importance of web pages, has a clear interpretation as a Markovian process, and can be computed in a very efficient way. For these reasons, most of today's commercial search engines are believed to use it as a part of their ranking function. In this paper we: · describe general ranking functions that depend on incoming paths of varying lengths, · show that PageRank belongs to this class of functions, · show how to compute these rankings, · compare the ranking orders induced by different ranking functions in real data, finding ways of approximating PageRank up to a very high precision. The rest of this paper is organized as follows: Section 2 introduces the notion of functional ranking, Section 3 describes three damping functions, Section 4 compares them analytically, and Section 5 experiments with different parameters for each function. Finally, the last section presents our conclusions. 1. INTRODUCTION While traditional Information Retrieval (IR) methods are used by web search engines to some extent, the web is much more massive, dynamic and less coherent than traditional text collections [2]. Partially supported by MIUR COFIN Project "Linguaggi formali e automi". Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00. 2. FUNCTIONAL RANKINGS In this section, we introduce the notion of functional ranking, a general family of ranking functions that includes PageRank. To describe PageRank formally, we consider a web graph of N pages. Let AN ×N be the adjacency matrix in this graph, ai, j = 1 iff there is a link from page i to page j. This link matrix is seldom used as it is, mainly for two reasons: 308 Normalization. In the Web, creating an out-link is free, so there is an incentive for web page authors to create pages with many outlinks; this is the reason why a metaphor of "voting" is enforced [26] in which each page has only one "vote" that has to be split among its linked pages. This is typically done in link-based ranking by normalizing A row-wise: the normalization process means that every web page can only decide how to divide its own score among the pages it leads to, but it cannot assign more score than it has. Another way to look at normalization is that the matrix is turned into the transition matrix of a stochastic process. The normalization does not need to give each out-link the same value, as there is evidence that web links have different purposes such as navigating in a multi-page set, expanding the contents of the current page, pointing to another resource, etc. [17]. Also, links within the same site can be considered self-links and as such do not confer as much authority as a link between different sites; indeed, there are ranking methods like BHITS [6] that treat them differently. Other characteristics of links, such as the exploration level at which they appear in Web sites [27], or if they are at the beginning or the bottom of individual pages, or inside a certain HTML element, can also be used for non-uniform normalization [3]. To simplify our treatment, we will assume uniform normalization, so if a page has d out-links, each of those links has a weight of 1/d , but the results of this paper can be applied to other forms of normalization. Dangling nodes. Special attention should be paid to the possible presence of nodes with no outgoing arcs (known as "sinks" in graph theory): in fact, dangling nodes fail to produce a row-stochastic matrix, because the rows of dangling nodes are filled with zeroes. Dangling nodes can be dealt with by adding an extra node that is linked to and from all other nodes, or by introducing new arcs from each dangling node to every node in the graph [14]. In our analysis, we shall assume that all dangling nodes have been eliminated already in some way, so that we do not have to worry about their presence. All the algorithms we will present can be modified so that dangling nodes can be dealt with explicitly and with virtually no additional cost. Let P be the row-normalized link matrix of the graph with N nodes. PageRank r() is defined as the stationary distribution of the Markov chain with state transitions given by the matrix P + (1 - )1T v where [0, 1) is a parameter called damping factor (sometimes also called a dampening factor), and v is a fixed preference vector that may represent the interests of a particular user, or another ranking vector that is used for weighting pages. Note that the above matrix is ergodic (at least, if every entry of v is strictly positive), so it has exactly one stationary distribution. Even though most of our results can be easily restated with a non-uniform preference vector v, for the sake of clarity we shall only consider the uniform preference 1/N in the rest of the paper. As observed in [15, 8], the PageRank vector r() can be written as: 1 r() = (1 - ) t 1Pt , N t =0 or in matricial form: r() = (1 - ) 1 1(I - P)-1 N ||P|| < 1. graph p = x1 , x2 , . . . , xk , such that node xi is connected to node xi+1 , we define its branching contribution as follows branching( p) = 1 d1 d2 · · · dk-1 where d j is the outdegree, this is, the number of outgoing arcs, of node x j . Then, the ranking of node i according to PageRank is ri () = (1 - )| p| branching( p) N pPath(-,i) where Path(-, i) is the set of all paths into node i and | p| is the length of path p: this is because (Pt )i j contains the sum of the branching contributions of all paths of length t from i to j, as one can easily show by induction on t (a path of length 0 and branching 1 is also included in the summation). This way of expressing the PageRank of a node is interesting, because it highlights the fact that the rank of a node is essentially obtained as a weighted sum of contributions coming from every path entering into the node, with weights that decay exponentially in the length of the path. A natural generalization of this idea consists in taking into consideration a ranking R of the general form: R= or equivalently Ri = pPath(-,i) t =0 damping(t ) N 1 · Pt damping(| p|) 1 branching( p) N (1) 1 where the damping function is a suitable choice of weights. We call this form of ranking a functional ranking as it is parametrized by a damping function. This generalizes Lifantsev's [26] model in which the damping factor is a matrix of voting trust that is fixed during the computation, while in our case, this depends explicitly on the iterations. Our damping function could be even more 1 general by using D(t ), a damping matrix instead of damping(t ) N 1; in this paper we analyze only the latter form. Fogaras [15] proposed using decreasing link weights depending on path lengths in the reverse link graph, and used exponentially decreasing weights as in PageRank for finding good Web browsing "starting points" in the Web graph. Another, yet unexplored, possible direction would be to consider damping functions that depend on other properties of the paths (e.g., whether the path passes through some node out of a certain set) rather than on their length. As we have seen, generic PageRank is a functional ranking where the damping function damping(t ) = (1 - )t decays exponentially fast (something similar was first considered in citation analysis back in 1953! [22]). The next section shows several functional rankings by describing their damping functions. 3. DAMPING FUNCTIONS Formula (1) defines a form of ranking that is parametrized by a damping function; the latter describes how rapidly the importance of paths decays as the path length increases. A first, if only formal, problem is establishing which class of damping functions generates well-defined functional rankings. T H E O R E M 1. Every damping function such that the sum of dampings is 1 yields a well-defined normalized functional ranking. There is an equivalent, and actually very intriguing way of rewriting this formula, mentioned in [30] that leads to a conclusion similar to those of [10]: given a path, that is, a sequence of edges in the 309 Proof. As shown in [10, Corollary 2.4], for every pair of nodes i and j, and for every length t pPath(i, j),| p|=t 3.1 Exponential damping: PageRank As we already noted, PageRank can be seen as a functional ranking where the damping function decays exponentially: damping(t ) = (1 - )t . Since longer paths have less importance in the calculation of PageRank, it could be approximated by using only a few levels of links. In [11], it is shown that by using only the nodes at distance 1 from a target node (equivalent to linear damping with L = 2), PageRank values can be approximated with 30% of average error. Using nodes at distance 2, the average error drops to 20% and at distance 3, to 10%. After that, there are no significant improvements by adding a few more levels, and the cost (the number of nodes to be explored) is much higher. Computation. Since PageRank is the principal eigenvector of the modified graph matrix, it can be easily approximated by the iterative Power Method algorithm, as suggested by Page et al. in their original paper [31]; this iterative algorithm gives good approximations (both in norm and with respect to the induced node order) in few iterations, even though convergence speed and numerical stability decay when gets close to 1 [19, 18]. branching( p) 1. In other words, the sum of branching contributions of all paths of a certain length between two specific nodes does never exceed 1. A more general property holds (the proof is an easy induction on the path length): for every node i and every length t pPath(i,-),| p|=t branching( p) = 1. As a consequence, to guarantee that the functional ranking is welldefined and normalized (i.e., that rank values sum to 1) we need: i=1 pPath(-,i) N damping(| p|) 1 branching( p) = 1 N or equivalently t =0 damping(t ) N 1 pPath(-,-),| p|=t branching( p) = 1 . As pPath(-,-),| p|=t branching( p) = N the latter equality is equivalent to t =0 3.2 Linear damping As an (extreme) alternative to PageRank, let us consider a simple damping function such as: 2 (L-t ) L(L+1) t < L damping(t ) = 0 tL that is, a damping function that decreases linearly with distance, and reaches zero at distance L. The trivial case L = 1 gives a uniform ranking, and L = 2 is ranking by indegree, as in the latter case all paths of length 2 are not considered. From the definition, R = = = damping(t ) = 1. Of course, not all choices are equivalent, so we have to find out which functions generate better rankings. Since a direct link should be more valuable as a source of evidence than a distant link, we focus on damping functions that are decreasing on t , the length of the paths. Computation. For calculating functional rankings, we use the general algorithm shown in Figure 1; the next sections provide details on the initialization, stop condition and iteration steps for each calculation. Require: N: number of nodes, v: preference vector 1: for i : 1 . . . N do {Initialization} 2: S[i] R[i] START 3: end for 4: for k : 1 . . . do {Iteration step} 5: if STOP then 6: break 7: end if 8: Aux 0 9: for i : 1 . . . N do {Follow links in the graph} 10: for all j such that there is a link from i to j do 11: Aux[j] Aux[j] + R[i]/outdegree(i) 12: end for 13: end for 14: for i : 1 . . . N do {Add to ranking value} 15: R[i] Aux[i] × DAMP(k) 16: S [i ] S [ i ] + R [ i ] 17: end for 18: end for 19: return S Figure 1: Template algorithm for computing a functional damping. START, STOP and DAMP(k) differ for each functional ranking. t =0 damping(t )vPt = L(L + 1) vPt t =0 L 2(L - t ) 2 (L - t )Pt v L(L + 1) t =0 2 v(L(I - P) - P(I - PL ))((I - P)2 )-1 . L(L + 1) L-1 provided that (I - P)2 is not singular. An advantage of this type of ranking is that only the first few levels are considered, so the number of iterations is fixed. The rationale for this is that after a certain distance the information given by links can be disregarded. Computation. For computing this functional ranking, we can define the following sequence: R(0) R(k+1) = = 2 v L+1 (L - k - 1) k R P. (L - k) -1 The functional ranking with linear damping is L=0 R(k) . For k computing this ranking, the generic algorithm shown in Figure 1 can be used, with: START : 2v[i]/(L + 1) STOP : k = L DAMP(k) : (L - k)/(L - (k - 1)) 310 3.3 Quadratic hyperbolic damping: TotalRank Recently, a ranking method called TotalRank [7] has been proposed. The method aims at eliminating the necessity for an arbitrary parameter by integrating PageRank over the entire range of . If r() is the vector of PageRank, then TotalRank is defined as: T= T can be written as: Z1 0 It is easy to see that t =0 R(k) = s(), because R(k) = 1/(N · )) 1 · Pk ; this observation allows us to use the generic ()(k + 1) algorithm of Figure 1 with the following parameters: Z1 0 START : v[i]/() STOP : convergence DAMP(k) : (k/(k + 1)) r()d . r()d = = 1 N t =0 Z1 0 (1 - )t 1 · Pt d 1 1 1 · Pt , N t (t + 1)(t + 2) =0 Note that convergence speed is much slower than ordinary PageRank, especially when is close to 1, the norm of the k-th summand being bound by 1/(1 + 1/k) . Interestingly enough, though, convergence speed is reasonable if is sufficiently large. 3.5 An empirical damping An empirical damping function would consider how much the value of an endorsement decreases by following longer paths in the real web graph. This cannot be known exactly, but we can attempt to measure it indirectly. Pages that link to each other are more similar than pages chosen at random [13]; evidence from topical crawlers [34] shows that when doing breadth-first exploring, the topic "drifts" as the distance increases. On the same line of thought, we propose to use the decrease of text similarity as an approximation to an "empirical" damping function. In [29] it is shown that text similarity and link distance are anticorrelated up to 4-5 links. To find out which is the correlation between link-distance and similarity, we performed the following experiment: we considered a web graph corresponding to a partial snapshot of the .uk domain with 18 million pages, and sampled 200 nodes at random. For each sampled node, we followed links backwards to obtain nodes at a minimum distance of 1, 2, 3, 4, or 5 links. Then, we sampled 12,000 pairs at each minimum distance at random, and computed their similarities with the original nodes. Similarity was measured using the normalization of TF.IDF [4], without stemming or stopword removal. The resulting averages are shown in Figure 2, with standard deviation error bars. Text similarity clearly decreases with distance, and in some applications the empirical distribution of text similarity versus distance could be used as an "empirical" damping function. Different measures of text similarity can yield different distributions; for instance [36] uses the number of repeated words and phrases between pages and obtains a faster decrease in similarity. Our results show that a linear damping with L = 8 or L = 9 approximates better text similarity than an exponential damping as suggested in [29]; also, for different communities the link structure where the first equality is obtained applying Theorem 1.27 of [33]. Provided that P is not singular and P = I, we can write TotalRank using the definition of the logarithm of a matrix: ln(I - P) = - Pk k=1 k T = P-1 (I + (I - P-1 ) ln(I - P)) TotalRank is a weighted sum of the scores associated with paths of varying lengths, in which the weights are hyperbolically decreasing on the lengths of the paths. In other words, TotalRank is a functional ranking with damping function: 1 1 1 = - , (t + 1)(t + 2) t + 1 t + 2 which is well defined since t =0 damping(t ) = 1. Computation. It is known that the cost of calculating TotalRank is the same as the cost of calculating PageRank via the Power Method [8], even though some more iterations are required to obtain the same precision. damping(t ) = 3.4 General hyperbolic damping: HyperRank TotalRank is part of a more general family of weighting schemes for paths of different lengths that can be approximated using: s() = 1 1 (t + 1) 1 · Pt . N () t =0 Again, this way of ranking follows the general scheme, with damping function chosen as damping(t ) = 1 . ()(t + 1) 0.7 0.6 0.5 0.4 0.3 0.2 1 2 3 Link distance 4 5 Here, we are using Riemann's zeta function, () = t =1 t - for normalization, and we need > 1 for it to converge. Note that when = 2 we get weights similar to those of TotalRank, in which the t -th coefficient is 1/(t + 1)(t + 2) whereas here it is 1/(2)(t + 1)2 . A meaningful choice for should be done considering the distribution of paths of different lengths in a scale-free graph. A large in PageRank, or a small in HyperRank, means increasing the effect of longer paths in the score. Computation. Let us define a vector sequence R(t ) as follows: R(0) R(k+1) = = 1 N () k +1 R( k ) P . k+2 Figure 2: Link distance vs. average text similarity. A link distance of one means a direct link exists. Text similarity appears to decrease linearly in the first few levels of links. 311 Average text similarity could be different (e.g. academic vs. commercial Web subsets), so we should measure first which is the correlation of link distance to text similarity in the specific collection we want to rank. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 1.5 2.0 2.5 3.0 4. COMPARING DAMPING FUNCTIONS A comparison of the damping functions described in the previous section is shown in Figure 3: of course, hyperbolic damping functions decay asymptotically more slowly than exponential damping, but notice that for short paths the latter may dominate the former in many cases. 0.25 0.20 0.15 0.10 0.05 0.00 1 2 3 4 5 6 7 8 9 10 Length of the path (t) PageRank =1/2 PageRank =1/3 PageRank =1/4 TotalRank, HyperRank =2 HyperRank =3 HyperRank =4 5 10 15 20 25 Length 1.0 that minimizes the difference of sum of weights 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 1 Max path length=25 Max path length=20 Max path length=15 Max path length=10 Max path length=5 1.5 2 2.5 Exponent 3 3.5 4 Figure 3: Weights given by the different damping functions, for some values of and . In this section, we aim at analyzing how similar are these functional rankings, and how we could use one of the damping functions to approximate another with a suitable choice of parameters. Weight Figure 4: Best for minimizing the difference of the sum of weights between PageRank and HyperRank, for various parameter combinations. And if > L: L-1 t =0 4.1 Approximating HyperRank with PageRank Now we want to approximate the weights of: s() = using the weights of: r() = 1- t t P , N t =0 1 1 (t + 1) Pt N () t =0 ( 2(L - t ) 1 - )t - L(L + 1) + t =L (1 - )t We will assume that L, so the evaluation of the difference between the two rankings is done in an area in which both rankings have non-zero values. The L that minimizes the difference for a given combination of and is L (, ) a = = + (2 + 1) +1 + 1 + ( +1)/2 and we proceed again by considering paths up to a certain length: 1 T - (1 - )t . t =0 ()(t + 1) he minimum can be zero, and it is attained at: 1 1 . () t =0 (t + 1) ( 1 + +1 )2 + 4 ( + 2) 2(1 - +1 ) +1 +1+O nd we have plotted it for different values of and in Figure 5. 5. PARAMETERS FOR THE DAMPING FUNCTIONS For our experiments, we used several snapshots from the Web, including the .uk, .it and .eu.int domains. For comparison, we also considered a synthetic scale-free network produced according to the evolving model described by Kumar et al. [24] (a combination of preferential attachment and random links) with the parameters suggested by Pandurangan et al. [32]. As far as the latter is concerned, in the generated graph the exponents for the powerlaw in the center part of the distributions are -2.1 for in-degree and PageRank, and -2.7 for out-degree; we generated a 100,000-nodes graph without disconnected nodes. In this section, we study the behavior of the ranking functions for different values of their parameters. ( , ) = 1 - The that minimizes the difference of weights for different values of and of the maximum path lengths is shown in Figure 4. In the case of = 2, for instance, for path lengths up to 10 to 20, the best is between 0.75 and 0.85. 4.2 Approximating PageRank with LinearRank For approximating the damping function of PageRank with the damping function of LinearRank, we consider the summation of the differences up to a certain path length. If L: ( 2(L - t ) 1 - )t - L(L + 1) t =0 312 30 25 Frequency .it Web graph 40 mill nodes; avg. dist. 14.9 0.3 .uk Web graph 18 mill. nodes; avg. dist. 14.8 0.3 L 15 10 5 0.9 0.8 0.7 .6 0 0.5 0.4 0.1 Frequency 5 10 15 Distance 20 25 30 20 0.2 0.2 0.1 0.0 0.0 5 10 15 Distance 20 25 30 5 10 15 20 25 Length 40 L that minimizes the difference of sum of weights 35 30 25 20 15 10 5 0.5 0.6 0.7 Exponent 0.8 0.9 Max path length=25 Max path length=20 Max path length=15 Max path length=10 Max path length=5 .eu.int Web graph 800,000 nodes; avg. dist. 10.0 0.3 Synthetic graph 100,000 nodes; avg. dist. 4.2 0.3 Frequency 0.1 Frequency 5 10 15 Distance 20 25 30 0.2 0.2 0.1 0.0 0.0 5 10 15 Distance 20 25 30 Figure 6: Distribution of the average number of nodes at a certain distance from a given node in three Web samples and a synthetic scale-free network. Figure 5: Best L for minimizing the difference of the sum of weights between LinearRank and PageRank, for various parameter combinations. 5.1 Characteristic path lengths In scale-free networks, the distances between pairs of nodes follow a Gaussian distribution [1] (the average is not given in their paper). Analytic estimations for the average distance of a graph of scale-free network of n nodes include: O(log(n)) [35]; O(log(n) / log(n p)) in sparse graphs with p links [12]; 1 + log(n/z1 )/ log(z2 /z1 ) where z1 is the average indegree, and z2 is the average number of nodes at distance 2 [30]; and O(log(n)/ log(log(n))) [9]. We did the following experiment: starting from a node picked at random, we followed the links backwards and counted the number of nodes at different distances. The average distances found, appear to be growing (sub)logarithmically with the size of the graph. Figure 6 shows the distribution obtained in each sample (the synthetic graph has less variance due to its small size). For this experiment, we are not counting the pages without in-links. The act of linking a page represents human endorsement and should not be affected by the size of the graph. Neither the act of following a link, in terms of a random surfer, should be affected. However, an algorithm for propagating this endorsement through links for computing a ranking function needs to account for the typical distances involved; this need is typical in a situation where local properties have a global impact: for example, the addition of a single arc may reduce drastically the diameter of a graph. In most cases, researchers have used exponential damping with base 0.85 or 0.90 in graphs that are much smaller than the full Web (concept graphs, social networks, e-mail graphs, etc.), meaning that a potentially much larger fraction of the nodes contributed towards link ranking. We consider that in a smaller graph, the damping function should decay faster. Let's suppose that for a graph with N1 nodes it is found, by experimental or analytic means, that a good parameter for PageRank is . Now, we would like to have a good parameter for a graph 1 2 with the same properties, except that the size of the new graph is N2 < N1 . One possible approach, consistent with what we have done so far, is to consider that the sum of the weights up to the average path lengths of the graphs (L1 , L2 ) have to be similar for both rankings to behave in a similar way. If we take this approach: 1 - ( )L1 +1 1 2 = = 1 - ( )L2 +1 2 ( ) L2 +1 ( ) log(N2 ) 1 1 L1 +1 log(N1 ) An example that can be used in practice is the following: let's consider a web graph with N1 = 11.5 × 109 pages (the size of the full Web estimated by [16]), and another graph with only N2 = 50 × 106 pages (the size of the Web of a large country); the second graph is roughly 3 orders of magnitude smaller. If it is shown empirically that = 0.85 is a good value for the 1 PageRank parameter for the whole Web, then = 0.81 should 2 have a similar behavior in the 50-million page set, which is natural as the path lengths are shorter. If the subset of web pages were even smaller, for instance, N2 = 106 pages (the size of the web of a large organization), then = 0.76, and for smaller graphs of N2 = 105 2 nodes, = 0.72. We recommend using these values for graphs 2 that are not comparable in size to the full Web graph. 5.2 Experimental comparison In this section, we present experimental results about the similarity between the ranking orders induced by some of the functional rankings discussed in the previous sections. To perform the experiments, we used data from the U.K. Web graph. To compare ranking orders, we used Kendall's : a correlation measure related to the number of inversions in the rank order of one variable when the data is ordered according to the other variable ( = 1 means perfect agreement, = -1 means reverse ordering). We tested the correlation of in-degree with these damping functions. In general the correlation drops as more levels of links are considered (we omit the details here for lack of space, they will appear in the full version of this paper). Figure 7 (a) shows how 313 (a) PageRank vs. HyperRank 1 (b) PageRank vs. LinearRank 0.8 0.7 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.95 0.9 1.00 0.95 0.90 0.85 0.80 25 0.95 2 2.5 3 3.5 4 20 L 15 10 5 0.5 0.6 0.7 0.8 0.9 0.9 that maximizes Kendall's 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 1.5 2 Actual optimum Predicted optimum with length=5 L that maximizes Kendall's 25 Actual optimum Predicted optimum with length=5 20 15 10 5 2.5 3 3.5 4 0.5 0.6 0.7 Exponent 0.8 0.9 Exponent Figure 7: Comparison (using Kendall's ) between ranking orders in the U.K. web graph, with various damping parameters. There is a region of the parameter space in which the ranking orders are very close. The predicted combination of parameters that yields the minimum difference with = 5, is very close to the actual optimum in both cases. PageRank compares with HyperRank for various pairs of and . In the limit , 1 both rankings are equivalent, and they remain similar in a large region of the parameter space. We can see that the rankings obtained with HyperRank and PageRank can be almost equivalent (Kendall's 0.95). Moreover, the analysis shown in section 4.1 considering only paths of lengths less than 5, provides a very good approximation for the optimal combination of parameters. This means that in fact, the difference in the damping functions in the first few levels is crucial. The exponents required for giving a good approximation of PageRank are small when 0.7, limiting the practical applicability of HyperRank, as it does not converge more quickly than PageRank. As far as LinearRank and PageRank with = 0.8 . . . 0.9 are concerned, paths of roughly 10 to 20 links should be considered to obtain rankings that are almost equivalent, as shown in Figure 7 (b). The predicted optimum given in section 4.2 with = 5 (i.e., considering only the summation of the differences between both damping functions up to paths of length 5) is very close to what was obtained in practice. For = 0.8, calculating LinearRank with L = 10 (which means the same number of iterations) gives 0.98; for = 0.9, calculating LinearRank with L = 15 also gives 0.98. In both cases, the ranking order of PageRank is approximated by the ranking order of LinearRank with very high precision. Precision Finally, we focused our attention on the LinearRank ranking that uses linear damping, to see if LinearRank with a small number of iterations can provide a ranking that is competitive with PageRank. With this aim in mind, we used the WebTREC Gov2 collection (available from the University of Glasgow for research purposes, see http://ir.dcs.gla.ac.uk/test collections/). This collection consists of about 25 million documents obtained in 2004 from the .gov domain. We picked at random 50 tasks and manually created keyword queries for this evaluation, following the policy used in the standard ad hoc TREC tasks. We then used the Managing Gigabytes for Java (mg4j) framework to select from the collection 1000 pages matching each query according and re-ordered the query results according to the scores resulting from different link-based ranking strategies. On this graph, the PageRank calculation took 39 iterations to converge on the L1-norm of the difference between two iterations to less than 10-6 . We computed the standard precision and recall measures [4] and averaged them across all queries. Precision at result number N is the fraction of correct results in the first N results returned by the system; the "correct" results in our case are taken from the quality assessments included in this reference collection. This is shown in Figure 8. Of course using link ranking improves the precision over no ranking at all, and PageRank and LinearRank behave very similarly. For instance, if we compare the PageRank (that requires 39 iterations) with LinearRank at distance 5 (that requires 5 iterations only) we observe that the precision of the first element is 8% better for PageRank, of the first five elements is 17% better for PageRank, but for the first ten elements is 2% better for LinearRank. From that point over, both rankings are roughly equivalent. This means that LinearRank at distance 5 can provide a level of precision for information retrieval tasks that is quite similar to that of PageRank. This is applicable in contexts where link-based 314 0.35 0.30 Average Precision 0.25 0.20 0.15 0.10 0.05 0.00 5 PageRank (39 iterations) LinearRank 5 iter. LinearRank 10 iter. LinearRank 20 iter. No ranking 7. REFERENCES [1] R. Albert, H. Jeong, and A. L. Barabasi. Diameter of the World Wide Web. ´ Nature, 401:130­131, 1999. [2] A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, and S. Raghavan. Searching the Web. ACM TOIT, 1(1):2­43, 2001. [3] R. Baeza-Yates and E. Davis. Web page ranking using link attributes. In Alt. track papers & posters, WWW Conf., pp. 328­329, New York, NY, USA, 2004. [4] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1999. [5] L. Becchetti, C. Castillo, D. Donato, S. Leonardi and R. Baeza-Yates. Using rank propagation and probabilistic counting for link-based spam detection. Tech. report DELIS-TR-0341, Dynamically-Evolving Large-Scale Inf. Sys. 2006. [6] K. Bharat and M. R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proc. of ACM SIGIR, pp. 104­111, Melbourne, Australia, 1998. ACM Press, New York. [7] P. Boldi. TotalRank: ranking without damping. In Poster Proc. of WWW Conf., pp. 898­899, Chiba, Japan, 2005. ACM Press. [8] P. Boldi, M. Santini, and S. Vigna. PageRank as a function of the damping factor. In Proc. of WWW Conf., pp. 557­566, Chiba, Japan, 2005. ACM Press. [9] B. Bollobas and O. Riordan. The diameter of a scale-free random graph. ´ Combinatorica, 24(1):5­34, 2004. [10] M. Brinkmeier. PageRank revisited. ACM TOIT, 6(3):257­279, 2006. [11] Y.-Y. Chen, Q. Gan, and T. Suel. Local methods for estimating PageRank values. In Proc. of CIKM, pp. 381­389, New York, USA, 2004. ACM Press. [12] F. Chung and L. Lu. The diameter of random sparse graphs. Adv. Appl. Math., 26:257­279, 2001. [13] B. D. Davison. Topical locality in the Web. In Proc. of ACM SIGIR, pp. 272­279, Athens, Greece, 2000. ACM Press. [14] N. Eiron, K. S. Mccurley, and J. A. Tomlin. Ranking The web frontier. In Proc. of WWW Conf., pp. 309­318, New York, USA, 2004. ACM Press. [15] D. Fogaras. Where to start browsing the Web? In IICS, vol. 2877 of Springer LNCS, pp. 65­79, Leipzig, Germany, 2003. [16] A. Gulli and A. Signorini. The indexable Web is more than 11.5 billion pages. In Poster Proc. of WWW Conf., pp. 902­903, Chiba, Japan, 2005. ACM Press. [17] S. W. Haas and E. S. Grams. Page and link classifications: connecting diverse resources. In DL '98: Proc. of ACM Conf. on Digital libraries, pp. 99­107, New York, NY, USA, 1998. ACM Press. [18] T. Haveliwala and S. Kamvar. The condition number of the PageRank problem. Tech. Report 36, Stanford University, 2003. [19] T. Haveliwala and S. Kamvar. The second eigenvalue of the Google matrix. Tech. Report 20, Stanford University, 2003. [20] T. H. Haveliwala. Topic-sensitive PageRank. In Proc. of WWW Conf., pp. 517­526, Honolulu, Hawaii, USA, 2002. ACM Press. [21] W.-K. Joo and S. H. Myaeng. Improving retrieval effectiveness with hyperlink information. In Proc. of IRAL, Singapore, 1998. [22] L. Katz. A new status index derived from sociometric analysis. Psychometrika, 18:39­43, 1953. [23] J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604­632, 1999. [24] R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Stochastic models for the Web graph. In Proc. of FOCS, pp. 57­65, Redondo Beach, USA, 2000. IEEE CS Press. [25] Y. Li. Toward a qualitative search engine. IEEE Internet Comp., July 1998. [26] M. Lifantsev. Voting model for ranking Web pages. In Proc. of the Int. Conf. on Internet Computing, pp. 143­148, Las Vegas, Nevada, USA, 2000. [27] T.-Y. Liu and W.-Y. Ma. Webpage importance analysis using conditional Markov random walk. In Proc. of IEEE/WIC/ACM WI Conf., Compiegne, France, 2005. ACM Press. [28] M. Marchiori. The quest for correct information of the Web: hyper search engines. In Proc. WWW Conf., Santa Clara, USA, 1997. [29] F. Menczer. Lexical and semantic clustering by Web links. Journal of ASIST, 55(14):1261­1269, 2004. [30] M. E. Newman, S. H. Strogatz, and D. J. Watts. Random graphs with arbitrary degree distributions and their applications. Phys. Rev. E Stat. Nonlin. Soft Matter Phys., 64(2), 2001. [31] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: bringing order to the Web. Tech. report, Stanford University, 1998. [32] G. Pandurangan, P. Raghavan, and E. Upfal. Using Pagerank to characterize Web structure. In Proc. of COCOON, vol. 2387 of LNCS, pp. 330­390, Singapore, 2002. Springer. [33] W. Rudin. Real and Complex Analysis. McGraw-Hill, May 1986. [34] P. Srinivasan, G. Pant, and F. Menczer. A general evaluation framework for topical crawlers. Information Retrieval, 8(3):417­447, 2005. [35] D. J. Watts and S. H. Strogatz. Collective dynamics of small-world networks. Nature, 393(6684):440­442, June 1998. [36] F. Wu, B. A. Huberman, L. A. Adamic, and J. R. Tyler. Information flow in social groups. Physica A: Statistical and Theoretical Physics, 337(1-2):327­335, June 2004. 10 N = cutoff Value 15 20 Figure 8: Evaluation of the precision of LinearRank and PageRank in the WebTREC Gov2 collection. ranking cannot be computed in advance, but a computation at query time is necessary. For instance, this occurs if we need to analyze links over a sub-graph that is generated at query time. 6. CONCLUSIONS In this paper we have defined a broad class of link-based ranking algorithms based on the contribution of damping factors along all the different paths reaching a page. We found that functional rankings using different damping functions can provide similar orderings, if the parameters are chosen carefully. LinearRank can be used for calculating a ranking that is as good as PageRank, but with a fixed, and smaller, number of iterations. Also, the parameters for the damping functions depend on the characteristic path lengths in the graph, which are known to grow sub-logarithmically on the size of the graph. More work is needed to find other damping functions that compute rankings similar to PageRank but are easier and faster to compute. We use a global ranking similarity, but another measure could be the ranking similarity in the top 20 results of real queries. In this setting our results can change, so future work will include this variation. Our results show that the exponential damping used by PageRank is not that special. Because of their high cost, link-based ranking methods that involve iterative calculations at query time are probably not used by large-scale search engines at this moment, but the functional ranking with linear damping we have presented can provide a good approximation with few iterations. Moreover, the approach we have presented could be also applied to multivalued ranking functions such as HITS [23] and topic-sensitive PageRank [20] to obtain, for instance, a method for approximating the hubs and authority scores using less iterations and a linear damping function. Our approach also helps to understand how easy or difficult it is to collude many pages to modify the ranking of a given page. Clearly there are many different factors: path lengths, damping function, branching degrees, and number of colluded pages. The graph structure of the collusion will affect those factors and we plan to analyze them. In particular, under the assumption that is easier to "spam" closer links, PageRank damping is more affected by collusion than the rest of the damping functions presented here. In [5] a truncated exponential damping, combined with other linkanalysis techniques, is used for spam detection. Acknowledgements: we would like to thank Daniel Fogaras for ´ a valuable discussion about TotalRank that motivated part of this research. The authors also thank the support from ICREA and the Catedra Telefonica at Universitat Pompeu Fabra. ´ ´ 315