WWW 2007 / Poster Paper Topic: Social Networks Finding Community Structure in Mega-scale Social Networks [Extended Abstract] Ken Wakita Tokyo Institute of Technology 2-12-1 Ookayama, Meguro-ku Tokyo 152-8552, Japan Toshiyuki Tsurumi Tokyo Institute of Technology 2-12-1 Ookayama, Meguro-ku Tokyo 152-8552, Japan wakita@is.titech.ac.jp ABSTRACT Community analysis algorithm proposed by Clauset, Newman, and Moore (CNM algorithm) finds community structure in social networks. Unfortunately, CNM algorithm does not scale well and its use is practically limited to networks whose sizes are up to 500,000 nodes. We show that this inefficiency is caused from merging communities in unbalanced manner and that a simple heuristics that attempts to merge community structures in a balanced manner can dramatically improve community structure analysis. The proposed techniques are tested using data sets obtained from existing social networking service that hosts 5.5 million users. We have tested three three variations of the heuristics. The fastest method processes a SNS friendship network with 1 million users in 5 minutes (70 times faster than CNM) and another friendship network with 4 million users in 35 minutes, respectively. Another one processes a network with 500,000 nodes in 50 minutes (7 times faster than CNM), finds community structures that has improved modularity, and scales to a network with 5.5 million. Further detail is reported in [3]. Categories and Subject Descriptors: H.2.8 [Database applications]: Data mining; G.2.2 [Graph Theory]: Graph algorithms General Terms: Algorithms, Performance Keywords: Community analysis, clustering, social networking service Time [sec] tsurumi2@is.titech.ac.jp work that consists of less than 500,000 users it was incapable to analyze larger networks. We observed that merging in coupling pair of community structures, balancedness of the pair has great impact on computational efficiency of CNM algorithm. Based on this observation, we have implemented a slightly modified versions of CNM algorithm and observed remarkable speedup and slight improvement of the modularity. 2. CNM ALGORITHM Newman and Girvan attempt to measure the quality of network clustering by means of modularity [2]. Their algorithm (CNM algorithm) is a bottom-up agglomerative clustering which continuously finds and merges pairs of clusters trying to maximize modularity of the community structure in a greedy manner [1]. The authors have programed CNM algorithm and attempted to analyze an friendship network of an SNS called "mixi1 " that hosted about one million users in October 2005. In spite of the good scalability as advertised in [1], we had to stop the experiment after a week when less than 10% of the whole analysis was finished. 1800 #Nodes = 500K #Nodes = 400K #Nodes = 300K #Nodes = 200K #Nodes = 100K 1600 1400 1200 1. INTRODUCTION 1000 Finding community structure in complex networks is an important first step to grasp inherent complex structure of social networks. Many research activities attempt to define the notion of communities and propose community analysis algorithms[2, 1]. We implemented a fast community analysis algorithm proposed by Clauset, Newman, and Moore [1] (CNM algorithm) and applied it to analyze various subsets of an acquaintance relationship network obtained from a social networking system (SNS). It performs well for a mid-scale subset of the netA long version of this paper is available as [3]. This work is supported in part by the Grant-in Aid (No. 18300041) of MEXT, Japan. Copyright is held by the author/owner(s). WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. 800 600 400 200 0 0 50000 100000 150000 200000 250000 300000 350000 400000 450000 #Joins Figure 1: Analysis time. To figure out the performance bottleneck, we conducted community analysis on a various subsets of mixi SNS network. The mixi SNS assigns each user an incrementa serial registration ID . Let us represent the mixi SNS network by a mixi (http://mixi.jp/) is the largest invitation-based SNS in Japan and it hosts about 7 million users (Feb. 2007). 1 1275 WWW 2007 / Poster Paper graph Gmixi = (U, F ), where U = {1, 2, . . .} is the set of user IDs and F U × U is a set of friendship relationship. A subset of this graph can be given as follows: Gn ixi = (Un , F (Un × Un ))where Un = {u U |u n} m Figure 1 illustrates time required for analysis of various subsets: GNixi . Each bar of the graph depicts time required to m merge 10,000 community pairs. For most of the computation time is consumed for the first half of the merging process which decreases dramatically for the latter half. [1] discusses the computational complexity of CNM algorithm can be approximated by O(n log2 n), for n nodes but this approximation contracicts super quadratic cost illustrated in Figure 1. Topic: Social Networks Algorithm 1 Outline of the proposed algorithm. ratio (ci , cj ) min(s(ci )/s(cj ), s(cj )/s(ci )); while (true) { updateDeltaQ(); Find (ci , cj ) C 2 that has maximum QCi ,cj · ratio (ci , cj ). c if (max(QCi ,cj < 0) break; c C := join(ci , cj ); } Table 1: Elapsed time (minutes): Java 5.0, 3.2GB Heap, Intel Xeon 2.80GHz. G20i0K G40i0K G60i0K G80i0K m xi m xi m xi m xi CNM 42.2 197 NA NA HE 2.15 6.80 13.6 24.5 HE' 8.52 35.5 68.2 124 HN 0.43 1.16 2.05 3.17 G1M i mi x NA 36.2 173 4.47 G4M i mi x NA 243 3,400 33.3 by the number of its members, and HE measures a community by the number of edges that link this community with others, and HE' is a hybrid of CNM algorithm and HE. 4. EVALUATION Use of heuristics dramatically accelerates execution of community analysis (Table 1, Figure 3). The largest data set the original algorithm (Clauset+ (2004)) was possible to analyse is G50i0K . It took about 5.9 hours. The fastest heuristics (HN) m xi processes G1M i in less than five minutes. The slower on HE' mi x takes 3 hours but offers imporoved modularity [3]. Figure 2: Consolidation ratio of each merge step. From the merge logs that record how community pairs are merged into larger ones, it was observed that only a few large communities are growing fast, merging in many tiny communities. Figure 2 presents unbalancedness of merge steps in the course of community analysis for G50i0K . For this purm xi pose, we have defined the notion of consolidation ratio, which plots, for n-th merge step, c k := join(ci , cj ), (n, ratio (ci , cj )), where the size (s) of a community is measured in terms of the number of its links to other communities. ratio (ci , cj ) = min(s(ci )/s(cj ), s(cj )/s(ci )). The approximation of computational complexity (n log 2 n) of [1] is based on an anticipation that the merges are performed in a balanced manner, which in practice is not the case and thus leads to super quadratic computational complexity. 5. REFERENCES [1] A. Clauset, M. E. J. Newman, and C. Moore. Finding community structure in very large networks. Physical Review E, 70:066111, 2004. [2] M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical Review E, 69:026113, 2004. [3] Ken Wakita and Toshiyuki Tsurumi. Finding community structure in mega-scale social networks, February 2007, cs.CY/0702048. http://arxiv.org/abs/cs.CY/0702048v1. 400000 HE HE' HN 350000 300000 We have seen that if we could control the growth of communities so that they grow in a balanced manner, the performance of the algorithm will improve remarkably. The structure of the algorithm remains the same except for the way the commuity pairs are chosen: we use both QCi ,cj as well as c ratio (ci , cj ), which the original algorithm uses only the former. We can build variants of the algorithm choosing different metrics for computing the community sizes, s(c). We tested three variants HN, HE, and HE': HN measures a community Elapsed Time [sec] 3. HEURISTICS 250000 200000 150000 100000 50000 0 1e+06 1.5e+06 2e+06 2.5e+06 3e+06 3.5e+06 4e+06 4.5e+06 5e+06 5.5e+06 Size of Social Network Figure 3: Scalability 1276