WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China Measuring Extremal Dependencies in Web Graphs Yana Volkovich University of Twente P.O. Box 217, 7500 AE Enschede, The Netherlands Nelly Litvak Ber t Zwar t Georgia Tech. 765 Ferst Drive, NW Atlanta, Georgia 30332-0205 y.volkovich@ewi.utwente.nl n.litvak@ewi.utwente.nl ABSTRACT We analyze dep endencies in p ower law graph data (Web sample, Wikip edia sample and a preferential attachment graph) using statistical inference for multivariate regular variation. The well develop ed theory of regular variation is widely applied in extreme value theory, telecommunications and mathematical finance, and it provides a natural mathematical formalism for analyzing dep endencies b etween variables with p ower laws. However, most of the prop osed methods have never b een used in the Web graph data mining. The present work fills this gap. The new insights this yields are striking: the three ab ove-mentioned data sets are shown to have a totally different dep endence structure b etween different graph parameters, such as in-degree and PageRank. Categories and Sub ject Descriptors: E.1 [Data structures]: Graphs and networks; G.3 [Probability and Statistics]: Multivariate statistics General Terms: Algorithms, Exp erimentation, Measurement Keywords: Regular variation, PageRank, Web, Wikip edia, Preferential attachment University of Twente P.O. Box 217, 7500 AE Enschede, The Netherlands ber tzwar t@gatech.edu 2. DATA SETS We chose three data sets that represent different network structures. As the Web sample, we used the EU-2005 data set with 862.664 nodes and 19.235.140 links, that was collected by LAW [1]. We also p erformed exp eriments on the Wikip edia(En) graph, that contains 4.881.983 nodes and 42.062.836 links. Finally, we simulated a Growing Network by using preferential attachment rule for 90% of new links [2]. The graph consists of 10.000 nodes with constant out-degree d = 8. In Figure 1 we show the cumulative log10 0 10 In-degree Out-degree PageRank 0 10 In-degree Out-degree PageRank 0 In-degree PageRank Fraction of pages Fraction of pages 10 -2 Fraction of pages 6 10 -1 10 -2 10 -4 10 -2 10 -4 10 -6 10 -3 10 -6 10 0 10 2 10 4 10 6 10 -8 10 0 10 2 10 4 10 -4 10 10 0 10 2 10 4 In/out-degree, PageRank In/out-degree, PageRank In-degree, PageRank (a) Figure 1: (b) (c) Cumulative log-log plots for in/(out)-degree and PageRank: (a) EU-2005, (b) Wikip edia, (c) Growing Network 1. INTRODUCTION The question of measuring correlations in the Web data has led to many controversial results. Most notably, there is no agreement in the literature on the dep endence b etween in-degree and PageRank of a Web page [4, 5]. One of the main p oints that we make in this work is that the commonly used correlation coefficient is an uninformative dep endence measure in heavy-tailed data [3, 7] typical for the Web and Wikip edia graphs. A natural mathematical formalism for analyzing p ower laws is provided by the theory of regular variation. By definition, the distribution F has a regularly varying tail with index , if P(X > x) = x- L(x), x > 0, where L(x) is a slowly varying function, that is, for x > 0, L(tx)/L(t) 1 as t . Below we analyze the dep endence structure in the p ower law data by means of contemp orary statistical techniques sp ecially designed for multivariate regular variation [7]. The work is supp orted by NWO Meervoud grant no. 632.002.401 Copyright is held by the author/owner(s). WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. log plots for in-degrees, out-degrees and PageRank scores in all data sets. The PageRank scores in the network of n nodes are computed according to the classical definition [6]: P R(i) = c j i cj 1-c 1 P R (j ) + P R (j ) + , i = 1, . . . , n, dj n D n where P R(i) is the PageRank of page i, dj is the numb er of outgoing links of page j , the sum is taken over all pages j that link to page i, D is a set of dangling nodes, and c is the damping factor, which is equal 0.85 in our case. Throughout the pap er we use the scaled PageRank scores R(i) = nP R(i). The log-log plots for in-degree and PageRank in Figure 1 resemble the signature straight line indicating p ower laws. However, several techniques should b e combined in order to establish the presence of heavy tails and to evaluate the p ower law exp onent. Using QQ plots, Hil l and altHil l plots as well as Pickands plots [7] we confirm that the in-degree and PageRank follow p ower laws with similar exp onents for all three data sets. We also conclude that the out-degree can b e modeled reasonably well as a p ower law with exp onent around 2.5-3, see [8] for details. Although all plots in Figure 1 look alike, it does not imply that the three networks have identical structure. The main goal of the present work is to rigorously examine the dep endencies b etween the network parameters. 1113 WWW 2008 / Poster Paper April 21-25, 2008 · Beijing, China 1.15 In-degree vs. PageRank 1.1 1.05 1 0.95 0.9 0.85 0.8 0.75 0 1 2 3 4 5 6 7 0 0 0.5 1 1.5 1 GRNet EU-2005 Wikipedia 3. ANGULAR MEASURE Scaling ratio Supp ose we are interested in analyzing the dep endencies b etween two regular varying characteristics of a node, X and Y . Let Xj and Yj b e observations of X and Y for the corresp onding node j . Following [7], we start by using the rank y x transformation of (X, Y ), leading to {(rj , rj ), 1 j n}, y x where rj is the descending rank of Xj in (X1 , . . . , Xn ) and rj is the descending rank of Yj in (Y1 , . . . , Yn ). Next we choose k = 1, . . . , n and apply the p olar coordinate transform as follows k = k POLAR (Rj,k , j,k ), ,y (1) x rj rj . x 2 + y 2 , arctan (y /x) where POLAR(x, y ) = Now we need to consider the p oints {i,k : Ri,k > 1} and make a plot for cumulative distribution function of . In other words, we are interested in the angular measure, i.e. in the empirical distribution of for k largest values of R. Thus, unlike the correlation coefficient, the angular measure provides a subtle characterization of the dep endencies in the tails of X and Y, or, extremal dependencies. If such measure is concentrated around /4 then we observe a tendency toward complete dep endence, when a large value of X app ears simultaneously with a large value of Y . In the opp osite case, when such large values almost never app ear together, we have either large value of X or large value of Y , hence, should b e around 0 or /2. The middle case plots can b e seen as a tendency to dep endency or indep endency. In the case of bi-variate data, a suitable value of k can b e determined by making a Starica plot [7]. We consider radii R1,k , . . . , Rn,k from (1) and write R(i) for the ith largest value. To get Starica plot we graph R , . 1jn (j ) /R(k) , j R(j ) /k R(k) The idea is that for good k the ratio in the ordinate should b e roughly a constant and equal 1 for the values of the abscissa in the neighb orhood of 1. The plot looks different for the different parameters k and one can either find a suitable k by trial and error or use numerical algorithms. In general, if the plot is going steep up from y = 1 at x = 1 then the chosen k is too large. On the other hand, if the graph stabilizes around y = 1 for some x < 1 then it means that k is too small, and we miss some valuable tail data. In Figure 2(a) we show the Starica plot for in-degree and PageRank in the Web data sets for k = 100.000 as an example. We refer to [8] for more details and results. Fraction of pages 0.8 0.6 (a) (b) 0.4 0.2 Scaling constant Theta 1 EU-2005 Wikipedia 1 EU-2005 Wikipedia Fraction of pages 0.6 Fraction of pages 0.8 0.8 0.6 (c) 0.4 (d) 0.4 0.2 0.2 0 0 0.5 1 1.5 0 0 0.5 1 1.5 Theta Theta Figure 2: (a) Starica plot for in-degree and PageRank for EU-2005 data set; Cumulative functions for Angular Measures: (b) in-degree and PageRank; (c) in- and out-degrees; (d) out-degree and PageRank the PageRank p opularity measure can not b e replaced by in-degree without significant disturbance in the ranking. The picture is different in Figure 2(c). In the Web, the in- and out-degree tend to b e indep endent which justifies the distinction b etween hubs and authorities. In Wikip edia the degrees are dep endent but this dep endence is not absolute. Finally, the dep endence b etween out-degree and PageRank in the Web and Wikip edia resembles the patterns observed for in-degree and PageRank. These results are useful in several ways. First, they reveal some imp ortant structural features thus extending our knowledge on real-life networks. Second, by comparing the dep endencies for exp erimental and synthetic data we can considerably improve existing graph models. 5. REFERENCES [1] http://law.dsi.unimi.it/. Accessed in January 2007. [2] R. Alb ert and A. L. Barabīsi. Emergence of scaling in a random networks. Science, 286:509­512, 1999. [3] D. Chakrabarti and C. Faloutsos. Graph mining: Laws, generators, and algorithms. ACM Comput. Surv., 38(1):2, 2006. [4] D. Donato, L. Laura, S. Leonardi, and S. Millozi. Large scale prop erties of the webgraph. Eur. Phys. J., 38:239­243, 2004. [5] S. Fortunato, M. Boguna, A. Flammini, and F.Menczer. How to make the top ten: Approximating PageRank from in-degree. In Proceeding of WAW2006, 2006. [6] A. N. Langville and C. D. Meyer. Deep er inside PageRank. Internet Math., 1:335­380, 2003. [7] S. I. Resnick. Heavy-tail Phenomena. Springer, New York, 2007. [8] Y.Volkovich, N. Litvak, and B. Zwart. Measuring extremal dep endencies in web graphs. Memorandum 1858, 2007. http://eprints.eemcs.utwente.nl/11349/. 4. DEPENDENCE MEASUREMENTS We computed the pairwise angular measure for the suitable k's determined in [8]. In Figure 2(b-d) we depict [0, /2] against the fraction of observations where the angle is greater or equal to . The results are striking. For the Wikip edia data set we observe the indep endence of the tails of in-degree and PageRank. That is, an extremely high in-degree almost never implies an extremely high ranking. The picture is completely opp osite for Growing Networks, where the angular measure is indicating a complete dep endence. Thus, in highly centralized preferential attachment graphs, most connected nodes are also most highly ranked. Finally, the Web graph exhibits a subtle dep endence structure with an angular measure close to uniform on [0, /2]. This suggests that 1114