WWW 2007 / Track: Semantic Web Session: Semantic Web and Web 2.0 Analysis of Topological Characteristics of Huge Online Social Networking Services Yong-Yeol Ahn Depar tment of Physics KAIST, Deajeon, Korea Seungyeop Han Division of Computer Science KAIST, Daejeon, Korea Haewoon Kwak Division of Computer Science KAIST, Daejeon, Korea yongyeol@gmail.com syhan@an.kaist.ac.kr haewoon@an.kaist.ac.kr Sue Moon Hawoong Jeong Division of Computer Science KAIST, Daejeon, Korea Depar tment of Physics KAIST, Deajeon, Korea sbmoon@cs.kaist.ac.kr ABSTRACT Social networking services are a fast-growing business in the Internet. However, it is unknown if online relationships and their growth patterns are the same as in real-life social networks. In this paper, we compare the structures of three online social networking services: Cyworld, MySpace, and orkut, each with more than 10 million users, respectively. We have access to complete data of Cyworld's ilchon (friend) relationships and analyze its degree distribution, clustering property, degree correlation, and evolution over time. We also use Cyworld data to evaluate the validity of snowball sampling method, which we use to crawl and obtain partial network topologies of MySpace and orkut. Cyworld, the oldest of the three, demonstrates a changing scaling behavior over time in degree distribution. The latest Cyworld data's degree distribution exhibits a multi-scaling behavior, while those of MySpace and orkut have simple scaling behaviors with different exponents. Very interestingly, each of the two exponents corresponds to the different segments in Cyworld's degree distribution. Certain online social networking services encourage online activities that cannot be easily copied in real life; we show that they deviate from close-knit online social networks which show a similar degree correlation pattern to real-life social networks. Categories and Sub ject Descriptors: J.4 [Computer Applications]: Social and behavioral sciences General Terms: Human factors, Measurement Keywords: Sampling, Social network hjeong@kaist.ac.kr million users 2 years ago, one fourth of the entire population of South Korea. MySpace and orkut, similar social networking services, have also more than 10 million users each. Recently, the number of MySpace users exceeded 80 million with a growing rate of over a hundred thousand people per day. It is reported that these SNSs "attract nearly half of all web users" [1]. The goal of these services is to help people establish an online presence and build social networks; and to eventually exploit the user base for commercial purposes. Thus the statistics and dynamics of these online social networks are of tremendous importance to social networking service providers and those interested in online commerce. The notion of a network structure in social relations dates back about half a century. Yet, the focus of most sociological studies has been interactions in small groups, not structures of large and extensive networks. Difficulty in obtaining large data sets was one reason behind the lack of structural study. However, as reported in [2] recently, missing data may distort the statistics severely and it is imperative to use large data sets in network structure analysis. It is only very recently that we have seen research results from large networks. Novel network structures from human societies and communication systems have been unveiled; just to name a few are the Internet and WWW [3] and the patents, Autonomous Systems (AS), and affiliation networks [4]. Even in the short history of the Internet, SNSs are a fairly new phenomenon and their network structures are not yet studied carefully. The social networks of SNSs are believed to reflect the real-life social relationships of people more accurately than any other online networks. Moreover, because of their size, they offer an unprecedented opportunity to study human social networks. In this paper, we pose and answer the following questions: What are the main characteristics of online social networks? Ever since the scale-free nature of the World-Wide Web network has been revealed, a large number of networks have been analyzed and found to have power-law scaling in degree distribution, large clustering coefficients, and a small mean degrees of separation (so called the small-world phenomenon). The networks we are interested in this work are huge and those of this magnitude have not yet been analyzed. How representive is a sample network? In most networks, the analysis of the entire network is infeasible and sampling 1. INTRODUCTION The Internet has been a vessel to expand our social networks in many ways. Social networking services (SNSs) are one successful example of such a role. SNSs provide an online private space for individuals and tools for interacting with other people in the Internet. SNSs help people find others of a common interest, establish a forum for discussion, exchange photos and personal news, and many more. Cyworld, the largest SNS in South Korea, had already 10 Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. 835 WWW 2007 / Track: Semantic Web is unavoidable. We evaluate the popular snowball sampling method using the complete Cyworld network topology. How does a social network evolve? From the historical data of the Cyworld network, we reconstruct the past snapshots of Cyworld network. The evolution path of a social network may exhibit patterns of ma jor change. It would be of tremendous interest to latecomers, for it is literally a prediction of what they might turn into. This paper is organized as follows. In Section 2, we review related works. In Section 3, we describe social network topologies we have gained access to and, in Section 4, the snowball sampling method and metrics of interest in network analysis. Then we begin our topology analysis with that of Cyworld. In Section 5, we analyze the friend relationship network of Cyworld, perform a historical analysis on the network's evolution, evaluate the snowball sampling method, and compare a special subset of Cyworld network (the network of testimonials) with the complete network, and . In Section 6, we analyze MySpace and orkut networks. We compare all three networks in Section 7, along with a discussion on the origin of power-law in online social networks and the resemblance to real social networks. We conclude with a discussion on future work in Section 8. Session: Semantic Web and Web 2.0 3. ONLINE SNSS Social networking services (SNSs) provide users with an online presence that contains shareable personal information, such as a birthday, hobbies, preferences, photographs, writings, etc. Most SNSs offer features of convenience that help users form and maintain an online network with other users. One such feature is a "friend." A user invites another user to be one's friend. If the invited user accepts the invitation, a friend relationship is established between them. This friend feature is often used as a shortcut to others' front pages and, in some SNSs, a convenient tool to exchange messages and stay in touch. As SNSs have become very popular, leading sites boast of an extensive user base of tens of millions of subscribers. For this work, we have collected four sets of online social network topology data from three popular SNSs. All of them have more than 10 million users. We have obtained the entire network topology of Cyworld directly from its provider, and sampled others through web crawling. We believe our work is the first to analyze social networks of such magnitude. In Table 1, we summarize the four data sets described above. The metrics in the table are explained later in Section 4. Below we describe each network in detail. 2. RELATED WORKS The structural propertes, such as degree distribution, of large-scale social networks have received much attention since the uncovering of the network of movie actors [3]. It is followed by the analysis on the network of scientific collaboration network [5] and the web of human sexual contacts [6]. However, a link in these networks is different from a normal friend relationship and the large-scale analysis on the such networks has remained uncharted. Recently, the rapid growth of online social networking services made it possible to investigate the huge online social network directly. Since the rise of Cyworld, many SNSs including MySpace and orkut have grown. However, the analyses on these huge networks have been limited to cultural and business viewpoint [7]. Here, we introduce two relevant works on online social networks. First work is on an Internet dating community, called pussokram.com [8]. The dataset consists of about 30, 000 users and time series of all interactions. By network analysis, fat tails are found in all degree distributions from the networks made by several interaction layers: messages, guest book, and flirts. An interesting feature is super-heavy tails which go beyond the trend of small degree region. Another work investigates an online blog community, LiveJournal [9]. The number of users examined is 1, 312, 454, about half of whom publicize their snail mail addresses. By examining this partial list of real addresses of bloggers, the work uncovers the connection between online friendship and geography. The network's degree distribution also shows a weak but significant super-heavy tail, which deviates from the trend of small degree region. The huge size of online communities makes the sampling an inevitable process in analyzing the networks. Recently, extensive simulations are performed for several network sampling methods [10] and the effect of missing data in social network analysis is studied [2]. 3.1 Cyworld Cyworld is the largest and oldest online social networking service in South Korea. It began operation in September 2001, and its growth has been explosive ever since. Cyworld's 12 million registered users, as of November 2005, are an impressive number, considering the total population of 48 million in South Korea. As any SNS, Cyworld offers users to establish, maintain and dissolve a friend (called ilchon) relationship online. From SK Communications, Inc. the provider of the Cyworld service, we received an anonymized snapshot of the Cyworld user network taken in November 2005. The snapshot contains 191 million friend relationships among 12 million users. We have access to additional data to study the network evolution and describe it in Section 5.2. Cyworld offers a mechanism called ilchon pyung for friends to leave a testimonial on a user's front page1 Friends can leave one testimonial each, though modifiable more than once, and it usually describes one's impression or a word of encouragement. Not all friends write a testimonial and thus it can be construed as a manifestation of a close relationship in real life. We have decided to include testimonials in our analysis for comparison with the complete Cyworld network. We have used snowball sampling and collected testimonials from about 100,000 users. Note that the testimonial network is a directional graph, where a link represents one user's testimonial on another user's front page, and the complete Cyworld friend network is an undirected graph. 3.2 MySpace As of November 2006, MySpace is the largest social networking service in the world, with more than 130 million users. It began its service in July 2003, and the number of users grew explosively. According to Alexa.com2 , it is the world's 5th most popular website (4th among English websites). 1 Only 101 testimonials were allowed on front page originally, but the restriction was lifted. 2 http://www.alexa.com/ 836 WWW 2007 / Track: Semantic Web Set # sampling ratio p no. of nodes N no. of edges L mean degree k avg. clustering coefficient C assortativity r estimated degree of separation I Cyworld 100% 12, 048, 186 190, 589, 667 31.6 0.16 -0.13 3.2 II Testimonial(Cy) 0.77% 92, 257 703, 835 15.3 0.32 0.43 7.1 Session: Semantic Web and Web 2.0 III MySpace 0.08% 100, 000 6, 854, 231 137.1 0.26 0.02 2.7 IV orkut 0.30% 100, 000 1, 511, 117 30.2 0.31 0.31 3.8 Table 1: Summary of data sets from online so cial networking services A new user in MySpace by default gets a friend relationship with Tom Anderson, the cofounder of MySpace. In our dataset, we exclude links to him, since he has links to everyone. MySpace offers similar features with other social networking services, such as writing testimonials to friends on their front pages, checking upcoming birthdays, shortcuts to friends' front pages. We have obtained 100,000 user information from the MySpace friend network by crawling the MySpace online web site from September to October, 2006. The crawler randomly selects a starting user site, and crawl the user's friends' pages, their friends' pages, and so on. We have left out users who do not publicize their firends' list, and the amount of those users were about 23% out of all the nodes we have crawled. sampling, we randomly select a fraction of nodes. Links between selected nodes are included in the final sample network. Link sampling is similar. We randomly select a fraction of links and construct a sample network. In contrast, snowball sampling randomly selects one seed node and performs a breadth-first search (hence the name, snowball sampling), until the number of selected nodes reaches the desired sampling ratio. Only those links between selected nodes are included in the final sample network. The snowball sampling method is the only feasible one to crawl the web for the following reasons. First, node and link sampling methods are highly inefficient in practice, as accessing nodes or links randomly is simply not doable. Secondly, if the sampling fraction is not large enough, the sample network is likely to have many isolated clusters and be far from the original network in many aspects of interest. Furthermore, the expected mean degree of the sample network is always much smaller than of the original network. In short, we always underestimate the node degree, cannot estimate the degree of separation, nor find hubs (nodes with a very large number of neighbors), if we use node or link sampling. The main estimation error of the snowball sampling lies in the likelihood of oversampled hubs [10], for they have many links and are easily picked up in the first few rounds of the breadth-first search. In order to evaluate the deviating impact of snowball sampling on the metrics of interest, we take full advantage of the complete Cyworld network and compare various metrics between partial and complete networks. It is known that the power-law nature in the degree distribution is well conserved under snowball sampling [10] since the snowball sampling method easily picks up hubs. This property reduces the degree exponent and produces a heavier tail, but it is very difficult to get a power-law degree distribution from a network without the power-law decaying degree distribution. 3.3 orkut In September 2002, orkut began its trial service by a few Google employees, and became an official Google service in January 2004. Until recently, orkut accounts were given only to people invited by already existing users, which is different from Cyworld or MySpace. As it has permitted any user to create an account without invitation, it has expanded fast; the number of users reached 1 million at the end of July and surpassed 2 million by the end of September 20043 . Today, the number of orkut users exceeds 33 million. Once a user joins orkut, one can publish one's own profile, upload photos, and join communities of interest. Orkut also offers friend relationship. The maximum number of friends per user was limited to 1000, but this limit has also been lifted. Crawling in a similar way to MySpace, we have collected orkut friendship data on 100,000 users from June to September, 2006. 4. ANALYSIS METHODOLOGY In this section, we outline the sampling method employed to crawl and capture MySpace and orkut networks, and describe briefly the metrics of topological characteristics and their interpretations. 4.2 Metrics of interest 4.1 Snowball sampling We have gained access to the entire topology of Cyworld through human contact, but were not successful with MySpace or orkut. Falling back on crawling for data collection, we are limited in the number of nodes we could crawl in a finite time frame. There are several network sampling methods: node sampling, link sampling, and snowball sampling [10]. In node 3 http://en.wikipedia.org/wiki/orkut We begin the analysis of online social network topologies by looking at their degree distributions. Networks of a power-law degree distribution, P (k) k- , where k is the node degree and 3, attest to the existence of a relatively small number of nodes with a very large number of links. These networks also have distinguishing properties, such as vanishing epidemic threshold, ultra-small worldness, and robustness under random errors [11, 12, 13, 14]. The degree distribution is often plotted as a compRementary cumulative l probability function (CCDF), (k) k P (k )dk k- - ( - 1) k . As a power-law distribution shows up as a straight 837 WWW 2007 / Track: Semantic Web line in a log-log plot, the exponent of a power-law distribution is a representative characteristic, distinguishing one from others. Next, we examine the clustering coefficient. The clustering coefficient of a node is the ratio of the number of existing links over the number of possible links between its neighbors. Given a network G = (V , E ), a clustering coefficient, Ci , of node i V is: Ci = 2|{(v , w)|(i, v ), (i, w), (v , w) E }|/ki (ki - 1) (1) Session: Semantic Web and Web 2.0 Palmer et al. propose an approximation for the effective diameter of a massive graph [20]. The effective diameter is the 90th-percentile of the path length distribution, and is a better metric than the maximum diameter in estimating the network size, as the maximum diameter can be an outlier from a small number of nodes forming a chain. Palmer's approach is useful when the complete network topology is known, but too large to load on memory and calculate the exact neighborhood function. As our Cyworld topology can be loaded onto our server's 4 GB memory and complete network topologies of MySpace and orkut are not available, we do not use their approximation. For Cyworld data, we estimate the average path length and the effective diameter from sample networks and use (3) to calculate average path lengths for MySpace and orkut networks. where ki is the degree of node i. It can be interpreted as the probability that any two randomly chosen nodes that share a common neighbor have a link between them. For any node in a tightly-connected mesh network, the clustering coefficient is 1. The clustering coefficient of a node represents how well connected its neighbors are. The clustering coefficient of a network is the mean clustering coefficient of all nodes. Often it is insightful to examine not only the mean clustering coefficient, but its distribution. We denote the mean clustering coefficient of degree k as C (k) and analyze its distribution. Unless stated otherwise, the clustering coefficient in the rest of the paper refers to the mean clustering coefficient of a network. The degree correlation, knn , is a mapping between a node degree k and the mean degree of nearest neighbors of those nodes of degree k. Its distribution is often characterized by the assortativity (r), which is defined as the Pearson correlation coefficient of the degrees of either nodes which is connected by a link [15]. It is expressed as follows: r= q ki kj - ki kj 2 2 ( ki - ki 2)( kj - kj 2) 5. ANALYSIS OF CYWORLD , (2) We begin our network analysis with the Cyworld data set. As it is the most extensive data set we have, this section includes basic analysis of topology-related metrics, evaluation of the snowball sampling method, historical analysis of the network evolution, and online networks' similarity to reallife social networks. In Section 5.1, we first calculate the degree distribution, clustering coefficient distribution, and degree correlation distribution of the complete Cyworld network. Then we study how the Cyworld network has evolved over time in Section 5.2. In Sction 5.3, we evaluate the snowball sampling method by comparing sample networks to the complete Cyworld network. In Section 5.4, we analyze the network of testimonials. 5.1 Complete friend relationship network where ki and kj are degrees of the nodes located at either end of a link and the · notation represents the average over all links. If a network's assorativity is negative, a hub tends to be connected to non-hubs, and vice versa. When r > 0, we call the network to have an assortative mixing pattern, and when r < 0, disassortative mixing. Most social networks exhibit an assortative mixing pattern, whereas other networks show a disassortative mixing pattern [15, 16]. The assortative mixing pattern is considered as a unique characteristic of social networks and its origin was suggested as rich community structures of human relationships [17]. As introduced in Stanley Milgram's experiment of mail forwarding [18], the degree of separation () is the mean distance between any two nodes of the network. Accurate calculation of the degree of separation or the average path length, as we call it in this paper, requires the knowledge of the entire topology and the time complexity of O(N L), where L is the number of links and N is the number of nodes. In huge networks like Cyworld, MySpace, and orkut, the calculation is infeasible. Only approximation is possible. From a snowball sample network, we measure the number of nodes at each round of breadth-first search. By extrapolating this number sequence, we predict how many steps are needed to cover the entire network, and obtain an estimate of the average path length by the following formula [19]. log(N/n1 ) + 1, log(n2 /n1 ) (3) where N is the total number of nodes and n1 and n2 are the average numbers of first and second neighbors respectively. The first metric we investigate is the degree distribution. Figure 1(a) plots the complementary cumulative distribution function (CCDF) of the complete Cyworld friend relationship network, Set I. Most networks in nature and human societies have a power-law degree distribution with a single exponent, , between 2 and 3, attesting to the existence of hubs or people with a very large number of friends. However, the Cyworld network shows two different scaling regions as in Figure 1(a). The crossover takes place between k = 100 and k = 1000 and divides the CCDF into two regions: a rapid decaying ( 5) region and a heavy tailed ( 2) region. This behavior has not been reported previously about any SNS topologies. (Note that the exponent of a CCDF is smaller than the exponent of a probability distribution function itself by one.) The multi-scaling behavior observed in Cyworld suggests that Cyworld consists of two different types of networks, i.e., two types of users. Further analysis of distributions of the clustering coefficient and the degree correlation also support the existence of heterogeneous types of users. We revisit this issue in Section 5.3 against the analysis of sample networks and discuss further in comparison with other social networks in Section 7. Next, we examine the clustering coefficient. Figure 1(b) displays the entire distribution of the clustering coefficient, C (k), versus degree k. (Figures 1(b) and 1(c) also include graphs from the testimonial networks that we defer discussion on until Section 5.4.) The average clustering coefficient of the Cyworld network is 0.16, which is smaller than those of other network data sets to be discussed in later sections. It implies that Cyworld friend relationships are more loosely 838 WWW 2007 / Track: Semantic Web 100 10 -1 100 Friends network Testimonial network 104 Session: Semantic Web and Web 2.0 Friends network Testimonial network 10-2 10-3 (k) 10-4 10-5 10-6 10-7 10-8 100 -4 C(k) 10-1 -0.4 10-2 knn(k) 103 102 -1 10-3 101 (a) (b) 101 102 103 k 104 105 106 10-4 100 101 102 103 k 104 105 106 100 100 (c) 101 102 103 k 104 105 106 Figure 1: Top ological characteristics of the complete Cyworld network connected than other networks. However, a closer examination of the clustering coefficient distribution reveals that it actually has two different scaling regions, as in the degree distribution. Up until k = 500, the graph of the clustering coefficient distribution has a power-law distribution with an exponent of 0.4. Then beyond k = 500, the graph suddenly drops and disperses. (The dispersion is mainly due to small number of nodes in that region.) That is, neighbors of nodes with degree larger than 500 are very loosely connected and different from those of nodes with smaller degrees. We plot the distribution of degree correlation, knn , of the complete Cyworld network in Figure 1(c). It exhibits not a simple pattern, but a more complex and intractable one, just as the degree distribution of Figure 1(a) or the clustering coefficient distribution of Figure 1(b) do. If a network has an assortative mixing pattern, the degree correlation distribution exhibits an increasing scaling behavior; if disassorative, then a decreasing one. The overall trend of Figure 1(c) is decreasing, and the Cyworld network's assortativity is a negative value: -0.13. As human social networks are known to exhibit an assortative mixing pattern from hubs attracting hubs, lack of such a pattern and the complex, heterogeneous structure in the degree correlation distribution imply mixing of different types of users. Far-fetched, we could claim that the top left corner of Figure 1(c) shows negative correlation and the remaining upper cluster displays (although very slight) positive correlation, which is the main characteristic of social networks [17]. The degree correlation of large hubs seems to be neutral or slightly disassortative. We revisit this issue later with sample network analysis in Section 5.3. 0.6 First sample 100 samples 2000 samples 3000 samples the graphs densify [4]. They also report that the effective diameters of those graphs shrink over time. We estimate the average path length between nodes from sample networks. We randomly choose a seed node, run a breadth-first search of the network, and obtain a distribution of path lengths between the seed and all other nodes. We repeat this network sampling with 100, 2000, and 3000 seeds, and obtain Figure 2. The inset in the figure plots the cumulative distribution function. We can tell from the figures that the average path length between 90% of nodes is less than 6, even though the maximum distance may reach 18, more than double the value for 90%. In order to confirm that the distribution of the average path length eventually converge to that of the complete network, we have looked at the incremental change as the number of sample networks increased. We observe that the root mean square of the difference between the empirical probability density distributions of sample networks with h and (h + 1) seeds steadily decreases fitting to the y = 1/x line, confirming that our estimation would converge to the true average path length. 5.2 Historical analysis 50 45 40 Population (millions) 35 30 25 20 15 10 5 99/12 00/12 01/12 02/12 Time 03/12 04/12 05/12 Population Internet users Weekly Daily Cyworld Cyworld users w/o ilchon Snapshot I,II, Set I Dated partial set 0.5 0.4 Probability 1 0.3 CDF 0.5 Figure 3: Comparison of the national p opulation, Internet users, and Cyworld users As of March 2006, the population of Korea reached 48 million, and there were 24 million Internet users of ages over 15, the 6th largest in the world [21]. Internet users can be classified as "weekly" users, if they use the Internet at least once a week, or "daily" users, if they do every day. According to a market research company, Korean Click, the number of weekly users in 2000 was about 15 million and that of daily users was 10 million. When we extrapolate the numbers of daily and weekly users from 2000 to March 2006 along the increase in the overall Internet users, we obtain the graphs labeled as weekly and daily in Figure 3. 0.2 0.1 0 0 4 8 12 16 0 0 2 4 6 8 10 12 14 16 Distance from the Seed Figure 2: Probability distribution function of distances b etween no des Leskovec et al. reports that the average node degree of a wide range of real graphs increases over time and thus 839 WWW 2007 / Track: Semantic Web 100 10 -1 Session: Semantic Web and Web 2.0 100 10-1 10-2 10-3 -4 2005-11 2004-12 2003-12 2002-12 2005-11 2004-12 2002-12 104 2005-11 2004-12 2002-12 10-2 10-3 (k) 10-4 10-5 10-6 10 -7 103 knn(k) C(k) 102 10 101 (a) 101 102 103 k 104 105 106 10-5 10-6 100 (b) (c) 101 102 103 k 104 105 106 100 1 10 100 1000 k 10000 100000 1e+006 10-8 0 10 Figure 4: Evolution of the top ological characteristics of Cyworld. From Table 1, we know that the number of Cyworld users reached 12 million in November 2005. In addition to Set I, we have acquired several other data sets to investigate the historical growth of Cyworld. We have monthly statistics on the total numbers of Cyworld users from the very beginning of the Cyworld service in 1999, as well as the numbers of users with friend relationships. The former includes all users with or without any friend relationship. We have acquired two additional snapshots of existing friend relationships from April and September of 2005, which we call Snapshots I and II, respectively. Snapshots I and II, as well as Set I, include friend relationships. In other words, those users who have not established any friend relationship are not included. We also have acquired partial data (about 42%) on dates of establishments and dissolutions of friend relationships from SK Communications, Inc., which we call Dated Partial Set (DPS). Due to lack of knowledge about the biases in DPS, we assume that it is a random sample. We reconstruct partial friend relationship networks at every six month interval starting in December of 2002 from Dated Partial Set. Figure 3 plots all the historical data we have about Cyworld, the Internet users, and the Korean population. The curvy graph just below the extrapolated line of daily Internet users represents the all-inclusive Cyworld users and the graph below it represents only those with friend relationships. The latter overlaps with three data points obtained from Snapshots I and II and Set I. They do not coincide exactly, for data was collected on different dates and marginal differences exist between data sets. The total numbers of all-inclusive Cyworld users and those with friend relationships follow each other closely up until June 2004 and start to deviate. In June 2006, the gap between the two widens to over 4 million. The dramatic increase in the number of users without friend relationships could be contributed to several factors. One of the most likely factors is additional services offered by SK Communications, Inc. that require a Cyworld membership. People who want to use the service, but are not interested in SNS only join Cyworld and do not engage in building a social network. Whether a Cyworld user has a friend or not, the number of Cyworld users almost reaches the extrapolated number of daily Internet users in Korea in December 2005, about a third of the entire Korean population and more than 60% of Internet users. Considering the language barrier, it is unlikely that the user demography of Cyworld expands outside Korea much, as Brazilians represent a very active and significant portion in orkut or there are many Spanishspeaking users in MySpace. Thus we may conclude that Cyworld has reached saturation in terms of the number of users. The lowest graph in Figure 3 represents partial networks reconstructed from Dated Partial Set based on the date information of relationship establishment and dissolution. Partial friend relationship networks obtained from DPSs clearly do not sample the complete network equally every six months. However, we note that the sampling ratios in terms of the number of nodes are always higher than 5%. In order to study the historical growth of Cyworld in depth, we use 3 reconstructed partial networks from DPS, each in December of 2002, 2003, 2004, and Set1 from 2005. We refer to those DPS sets as DPS I, II, III, respectively. These networks are not snowball sampled, but are good representations of the complete networks, thanks due to the high sampling ratios of 5% or higher. Using these four sets, we study how the Cyworld network has evolved over time in terms of the degree distribution, the clustering coefficient distribution, and the degree correlation distribution. We first plot the CCDF of the degree distributions in Figure 4(a), and study the evolution of the degree distribution. The graph of DPS I is the shortest, reaching only up to k = 1000. It is close to a straight line, and we cannot see if it has a multi-scaling behavior. The graph of DPS II extends longer than that of DPS I and the tail part for k > 100 starts to exhibit a different slope. However, it is only with DPSs III and Set I that the multi-scaling behavior is clearly pronounced. From Figure 4, we can infer that user heterogeneity has started to materialize around December 2004. In Figure 4(b) we plot the distributions of clustering coefficients from December 2002, 2004, and November 2005. The distribution from December 2003 is very similar to that of 2004 and is omitted in order to avoid cluttering the plot. Up until December 2004, the distribution of clustering coefficient, C (k), is not power-law and looks similar from one year to another. From 2004 to 2005, the distribution of clustering coefficient changes significantly, and shows different regions of scaling behavior in November 2005. The degree correlation from December 2002, plotted in the lower left corner of Figure 4(c), exhibits an disassortative mixing pattern for k 50, but no clear pattern for k 50. The disassortative mixing pattern continues to manifest up to December 2004 for k 50. In December 2005, as we have observed already, the disassortative mixing pattern for k 100 is no longer clear, and the degree correlation for k 100 is spread out. The manifestation of different types of users is shown in all plots of metrics. As a last metric of historical analysis, we examine the evolution of the average path length and the effective diameter. 840 WWW 2007 / Track: Semantic Web 18 16 14 Path length 12 10 8 6 4 2 2002-12 Log(# of nodes) Effective Diameter Average Path Length Session: Semantic Web and Web 2.0 butions of sample networks have similar multi-scaling behaviors to that of the complete network: there are two regions of slow and rapid decaying. The exponents of the rapid decaying region range from 3.2 to 3.8, and those of the heavy tailed region from 1.2 to 1.8. As observed in [10], the exponents of sample networks are supposed to be smaller than that of the complete network. We can summarize sample networks have varying smaller degree exponents than the complete network, but clearly the same multi-region scaling behavior. Out of the above 10 sample networks, we choose two sample networks randomly and plot only the two in the rest of the analysis for convenience. We verified that including other sample networks did not change the qualitative conclusion in our analysis. Figure 6(b) depicts the distributions of clustering coefficient, C (k), of the two sample networks and the complete network. Recall that the complete network had a very clean scaling behavior up to k = 500 with an exponent of 0.4. The limited scaling behavior in the complete network vanishes; moreover, the two sample networks exhibit no scaling behavior in their clustering coefficient distributions. The average clustering coefficient of sample networks are larger than 0.16 of the complete network. The mean clustering coefficient of all 10 sample networks is 0.29. However, the converging behavior of the clustering coefficient (whether it increases or not) under the snowball sampling is not deterministic in general [10] and we cannot predict whether the original clustering coefficient of orkut and MySpace is larger than those from sample networks. We plot the degree correlation distributions of the two sample and complete networks of Cyworld in Figure 6(c). The sample networks exhibit a more definite disassortative mixing pattern in their degree correlation distribution. The distributions from the two sample networks exhibit a clear decreasing pattern for k < 100 and then disperse. In our preliminary work, we have evaluated how close topological characteristics of snowball sampled networks are to the complete network as we vary the sampling ratios [22]. From our numerical analysis, we suggest a practical guideline on the sampling ratio for accurate estimation of the topological metrics, excluding the clustering coefficient, where the explicit sampling ratio for accurate estimation is charted for the other metrics; 0.25% or larger for degree distribution, 0.2% or larger for degree correlation, and 0.9% or larger for assortativity. In the case of the clustering coefficient, even with a sampling ratio of 2%, it is inconclusive if the clustering coefficient of the sample network has converged close to that of the complete network. In summary, we observe the same multi-scaling behavior in degree distribution from sample networks, while the exponents of the distributions are different from that of the complete network. Other metrics, such as the clustering coefficient distribution, its mixing pattern, and the degree correlation, are not easy to estimate from samples. Though not rigorously validated, we demonstrate that only with a sampling ratio above a certain threshold, the scaling behavior of the node degree distribution is captured correctly in sample networks. 2003-06 2003-12 2004-06 Time 2004-12 2005-06 2005-11 Figure 5: Average path length vs time In Figure 5, we plot the log of the total number of nodes, the average path length, and the effective diameter as described in Sections 4.2 and 5.1. We use DPSs I to IV, as well as reconstructed networks from the original DPS for June of 2003, 2004, and 2005. Previously, a complex network is known to have the average path length scale logarithmically to the number of nodes, while a power-law network with 2 3 scales to log (log (N )). However, recent study by Leskovec et al. has reported that a wide range of graphs exhibit a densification trend and their diameters shrink over time [4]. In case of Cyworld, the average path length increases up to June 2004, and then starts to decrease, while the effective diameter vascillates until June 2004 and decreases afterwards. Due to space limitation, we only report that the number of nodes plotted against number of edges plotted in log-log scale exhibits a power-law distribution and thus the Cyworld network concurs to the same densification trend. We plot the effective diameter in Figure 5 and observe that it increases for the first year and a half of Cyworld service, and then starts to decrease. This junction coincides with the point in time when the number of users without friend relationships increases and the multi-scaling behavior starts to manifest in the degree distribution. We conjecture that the Cyworld has reached a saturation point as an active friend relationship network, and then graph densification has taken over, resulting in a decrease of both the average path length and the effective diameter. 5.3 Evaluation of the snowball sampling method As discussed in Section 4.1, snowball sampling is a better technique than node or link sampling methods, if its tendency to overrepresent hubs can be deliberated. In this section, we simulate a snowball sampling crawl on the complete Cyworld network, and evaluate the estimation error in snowball sampling. Here we randomly select 10 different seed nodes with k > 100 and get 10 sample networks starting from each seed. We set the sampling ratio to 0.33%, resulting in the final 40, 000 nodes in each sample network. Starting from seed nodes with smaller degrees does not make a difference, for within a few rounds of breadth-first search, a hub of k > 100 is reached, almost certainly as expected from the power-law degree distribution. Figure 6(a) plots the CCDF of the degree distributions of all 10 sample networks of Cyworld. For a reference, we include the cumulative degree distribution of the complete Cyworld network as an inset. The cumulative degree distri- 5.4 Testimonial network As described in Section 3.1, a testimonial represents a directional relationship of intimacy and interest in Cyworld. 841 WWW 2007 / Track: Semantic Web 100 Friends network Sample 1 Sample 2 104 Session: Semantic Web and Web 2.0 Friends network Sample 1 Sample 2 10-1 103 10-2 knn (k) 1 2 3 4 5 6 C(k) 102 10 -3 10 1 F 1 0-4 0 10 10 10 10 k 10 10 10 1 0 0 100 101 102 103 k 104 105 106 igure 6: Top ological characteristics of sampled network Not all online friends leave testimonials on their friends' front pages. We view testimonials as an online manifestation of close off-line relationship, and expect it to have close resemblance to real-life social networks. 100 10-1 10-2 (k) 10-3 10-4 10-5 10-6 in-degree out-degree exp(-x/18) knit relationships than the simple friends network. Accordingly, the testimonial network shows a positive degree correlation, which is a unique property of the social networks. 6. ANALYSIS OF ORKUT AND MYSPACE 0 50 100 k 150 200 250 Figure 7: Cumulative distribution of in-degree and out-degrees of Cyworld's testimonial network We have collected a sample network of 100,000 nodes for our testimonial network analysis. In contrast to the complete Cyworld network, both the in-degree and out-degree distributions of the sampled network show an exponential distribution. There is a sharp cut-off in in-degree distribution around k 100 which is due to the fact that the testimonial space is limited in the front page. However, the few data points near k 200 in the out-degree distribution indicates that some users leave extraordinarily many testimonials on others' front pages. To evaluate other metrics, we need an undirected network. We consider each link in this testimonial network as bidirectional in the following analysis. The clustering coefficient of Cyworld testimonial network is 0.32. As discussed before, the clustering coefficient from a sample network is hard to evaluate and we only report the value of 0.4. However the distribution of clustering coefficients as shown in Figure 1(b) has a clear scaling behavior, and the degree correlation in Figure 1(c) demonstrates an assortative mixing, a clearn sign of human social networking. We also estimate the average path length and it is about 7, larger than those of the friend relationship network shown in Figure 5. In summary, the testimonial network of Cyworld is a subset of the complete Cyworld friend network. However, its topological characteristics deviate from the complete network: it has exponential degree distributions, clear positive degree correlation and rather a large average path length. The testimonial network represents more tight and close- Figure 8(a) shows the degree distributions of orkut and MySpace. The degree distribution of orkut exhibits clear power-law with degree exponent 3.7. There is a cutoff around k 103 , which is due to orkut's old policy that the number of friends cannot exceed 1000. The degree distribution of MySpace also shows power-law with degree exponent 3.1, which is rather slow-decay in comparison with orkut. Users in MySpace have more friends than users in orkut. Thus, the degree distribution is broader. We should note that as in the sampling analysis of Cyworld, the sampling fraction of MySpace is not conclusive enough, thus, multi-scaling behavior might not be observed, even if it exists. Figure 8(b) shows the distributions of clustering coefficient. C (k) of orkut shows a tendency to decrease, but it is not evident. Our former analysis on the samples of Cyworld suggests that the decay in the real C (k) of orkut is more rapid than observed. The similar tendency is observed from the MySpace. The decreasing tendency of the degree correlation of orkut in the small and large degree region shown in Figure 8(c) may come from the effect shown in Cyworld sampling or the invitation system of orkut. Until recently, users could register to orkut only when he was invited by a user in orkut. So, new users were likely to be connected to heavy users. However, it now permits users to create accounts without an invitation. The assortativity of orkut friend network is 0.31. Orkut shares the property of the positive assortative mixing with the testimonial network of Cyworld, which is considered as a close-knit community. In contrast, the MySpace network's assortativity is almost neutral, r 0.02. Note that Figure 8(c) is depicted in log-log scale. In addition, the decay in degree distribution is rapid, even it is not exponential like the testimonial network of Cyworld. Since we only have data from a single seed, the estimation of degrees of separation for orkut and MySpace is much rougher than the case of Cyworld. In order to reduce the error, we let the number of first neighbors (n1 ) as the mean degree we measured, and we tune the number of second neighbors (n2 ) as to preserve the ratio of the number of first and second neighbors by assuming that the ratio (n2 /n1 ) is not so sensitive to the n1 . Note that the value n2 /n1 cannot fluctuate much due to the averaging effect, when the degree correlation is not strong. The calculated degrees of 842 WWW 2007 / Track: Semantic Web 100 10 -1 Session: Semantic Web and Web 2.0 100 orkut MySpace 105 orkut MySpace Cyworld orkut 10-2 10-3 10-4 10-5 10-6 10 -7 -2.7 10-1 104 10-2 Knn(k) (k) C(k) 103 -2.1 10-3 102 (a) 101 102 MySpace 103 k 104 105 106 10-4 0 10 (b) 101 102 k 103 104 105 101 100 (c) 101 102 k 103 104 105 10-8 100 Figure 8: Top ological characteristics of three so cial networks: Cyworld friends network, orkut, and MySpace. separation of orkut is 3.8, which is a little larger than that of Cyworld. The degrees of separation of MySpace is calculated in the same way with orkut and the value is 2.7, which is the smallest value among our data sets, while the size of myspace is the largest. which phenomenon stemming from early days as a popular site for the music community. The main webpage publishes "cool new people" and connecting to a new online friend is not a serious invitation-only affair. Cyworld has been a very private space in South Korea. By default, the list of friends is not made public, and neither are the photos and writings. On the other hand, Cyworld has also many features that encourage the creation of hubs. It picks two persons every day and puts their photos and link to their front pages at the main page. It is similar to the "cool new people" in MySpace. Moreover, Cyworld also has the club service that provides a homepage for various online communities. The organizer of a club tends to acquire many friends through the club activity. Previous work has hinted at a multi-scaling behavior for online social networks. Holme et al. analyze the multitudes of layers of an Internet dating community: messages, guest books, and flirts. In all layers, the degree distribution is power-law, with the tail slightly heavier than normal [8]. Liben-Nowell et al. shows the degree distribution of a network of more than one million bloggers and it also has a short, but heavy tail [9]. Neither network has a reference network for comparison and its scaling behavior was not explicable. Our analysis of Cyworld is a confirmation that those previous revelations are not an isolated incident, but stem from a firm underlying structure in online social networks. 7. DISCUSSION 7.1 Comparison of Cyworld, MySpace, and orkut We plot the three degree distributions of Cyworld, MySpace, and orkut together in Figure 8(a) for ease of comparison. In Figure 8(a), to compare the exponents more easily, the degree distribution graph of MySpace is rescaled and shifted to align with the heavy-tailed region of Cyworld while the range of both axes are identical to others. As seen previously, Cyworld has a multi-scaling behavior, while MySpace and orkut exhibit simple power-law. From Figure 8(a), we observe an interesting relation between the three degree distributions. The exponent of orkut matches that of the rapid decaying region of Cyworld, while the exponent of MySpace matches that of the heavy-tailed region of Cyworld. The exponent of orkut is = 3.7 and, for Cyworld, = 5. The two exponents are not exactly the same. However, as shown in Section 5.3, sample networks tend to have smaller exponents than the complete network and we circumspect that the real exponent of orkut be close to that of Cyworld's rapid decaying region. The exponent of MySpace is = 3.1 and that of Cyworld's heavy-tailed region, = 2. If we apply the same reasoning as in the case of orkut, the exponent of the complete MySpace network is probably larger than = 3.1. However, we expect it to be closer to that of Cyworld's heavy-tailed region than to that of orkut. Other quantitative relations give us more hints. The testimonial network's assortativity is the biggest, and that of Cyworld lies between that of orkut and MySpace. Same relation happens also in the degrees of separation. The testimonial network's degrees of separation is the largest, and that of Cyworld lies between that of orkut and MySpace. This peculiar correspondence is another corroborating evidence that Cyworld's heterogeneity comes from the mixing of two types of users: one, with an orkut-like structure and the other with a MySpace-like structure. Orkut is considered a relatively closed community, for a new user can register only by invitation. Relations in MySpace might be considered loose, as anyone can sign up without invitation and many are attracted to popularity, of 7.2 Origins of power-law behavior in online social networks The origin of power-law has been extensively studied and there are many mechanisms that produce power-law distributions. Which of those best explains the origin of the power-law in online social networks? The best known mechanism to generate power-law distributions is preferential attachment. Not only the well-known Barabīsi-Albert model, a but also many other mechanisms implicitly use preferential attachment. The transitive linking model [23], which is based on continuously completing triangles with only an edge missing, is one such example. Another noticeable viewpoint is fitness-based approaches. In any fitness-based approach, each node has its own fitness value and they are linked by the function of their fitness values [24, 25, 26]. In the case of the online social networks, both the preferential attachment and the fitness-based approach may contribute. More attractive and active persons are likely to have many online friends. Moreover, as one has more friends, it gets easier to have more friends through the transitive linking 843 WWW 2007 / Track: Semantic Web (a common friend of two persons introduces them to each other). A real-life social relation is harder to maintain than the online counterpart. When online, you do not move to a new place nor spend much time to make new friends. It is a lot easier to leave a message to online friends than to meet a friend in real life. And as told about Cyworld users' behavior, most online relations are not severed, even when not active. Dunbar claimed that the brain size in primates (including human beings) is correlated with the size of a community that they can manage. He estimated the mean community size of a human being to be about 147.8 [27]. Intriguingly enough, in Cyworld's friend relationship network, the crossover between two scaling regions takes place where 100 k 500, in the same order as Dunbar's law suggests. Session: Semantic Web and Web 2.0 [4] J. Leskovec, J. Kleinb erg, and C. Faloutsos. Graphs over time: densification laws, shrinking diameters and p ossible explanations. In SIGKDD, 2005. [5] M. E. J. Newman. Scientific collab oration networks. I. network construction and fundamental results. Phys. Rev. E, 64:016131, July 2001. [6] F. Liljeros, C. R. Edling, L. A. N. Amaral, H. E. Stanley, and Y. Ab erg. The web of human sexual contacts. Nature, 411:907, 2001. [7] A. Russo and J. Watkins. Digital cultural communication: Enabling new media and co creation in southeast asia. International Journal of Education and Development using Information and Communication Technology, 1(4), 2005. [8] P. Holme, C. R. Edling, and F. Liljeros. Structure and time-evolution of an internet dating community. Social Networks, 26:155, 2004. [9] D. Lib en-Nowell, J. Novak, R. Kumar, P. Raghavan, and A. Tomkins. Geographic Routing in so cial networks. Proceedings of the National Academy of Sciences, 102(33):11623­11628, Aug. 2005. [10] S. H. Lee, P.-J. Kim, and H. Jeong. Statistical prop erties of sampled networks. Phys. Rev. E, 73:016102, 2006. [11] R. Pastor-Satorras and A. Vespignani. Epidemic spreading in scale-free networks. Phys. Rev. Lett., 86:3200, 2001. [12] M. E. J. Newman. The spread of epidemic disease on networks. Phys. Rev. E, 66:016128, 2002. [13] R. Cohen and S. Havlin. Scale-free networks are ultrasmall. Phys. Rev. Lett., 90:058701, 2003. [14] R. Alb ert, H. Jeong, and A.-L. Barabasi. Error and attack tolerance of complex networks. Nature, 406:378, 2000. [15] M. E. J. Newman. Assortative mixing in networks. Phys. Rev. Lett., 89:208701, 2002. [16] M. E. J. Newman. Mixing patterns in networks. Phys. Rev. E, 67:026126, 2003. [17] M. E. J. Newman and J. Park. Why so cial networks are different from other typ es of networks. Phys. Rev. E, 68:036122, 2003. [18] S. Milgram. The small world problem. Psychology Today, 2:60­67, 1967. [19] M. E. J. Newman, S. H. Strogatz, and D. J. Watts. Random graphs with arbitrary degree distributions and their applications. Phys. Rev. E, 64:026118, 2001. [20] C. Palmer, P. Gibb ons, and C. Faloutsos. ANF: A fast and scalable to ol for data mining in massive graphs. In SIGKDD, 2002. [21] comscore. 694 million p eople currently use the internet worldwide according to comScore networks, 2006. [22] H. Kwak, S. Han, Y.-Y. Ahn, S. Mo on, and H. Jeong. Impact of snowball sampling ratios on network characteristics estimation: A case study of Cyworld. Technical Rep ort CS-TR-2006-262, KAIST, 2006. [23] J. Davidsen, H. Eb el, and S. Bornholdt. Emergence of a small world from lo cal interactions: Mo deling acquaintance networks. Physical Review Letters, 88:128701, 2002. [24] K. I. Goh, B. Kahng, and D. Kim. Universal b ehavior of load distribution in scale-free networks. Physical Review Letters, 87:278701, 2001. [25] G. Bianconi and A. L. Barabasi. Comp etition and multiscaling in evolving networks. Europhys. Lett., 54:436, 2001. [26] G. Caldarelli, A. Cap o cci, P. De Los Rios, and M. A. Munoz. Scale-free networks from varying vertex intrinsic fitness. Phys. Rev. Lett., 89(25), Decemb er 2002. [27] R. I. M. Dunbar. Co evolution of neo cortical size, group size and language in humans. Behavioral and Brain Sciences, 16(4):681­735, 1993. 7.3 Online networks like real social networks We conjecture that Cyworld's testimonial network is very similar to off-line social networks. First, strangers cannot write a testimonial , but only confirmed friends can. Even out of confirmed friends, not all are tempted or care enough to write a testimonial on a friend's front page for the rest of friends to view. Thus the testimonial network should be as an extremely close-knit community. Its close-knit nature is demonstrated by the exponentially decreasing degree distribution that decays more rapidly than power-law and has a definite cutoff. It also has an very clear assortative mixing pattern which is a ma jor feature of social networks. We conclude that the testimonial network is the closest of the four online social networks to real-life social networks. 8. CONCLUSIONS We have analyzed the complete network of an online social networking service, Cyworld. In addition, we have also analyzed sample networks from Cyworld, orkut, and MySpace in terms of degree distribution, clustering coefficient, degree correlation, and average path length. We report a multi-scaling behavior in Cyworld's degree distribution and have substantiated our claim that heterogeneous types of users are the force behind the behavior with detailed analysis of the clustering coefficient distribution, assortativity (or disassortativity), and the historical evolution of the network size, the average path length, and the effective diameter. The observation that the scaling exponents of MySpace and orkut match those from different regions in the Cyworld network is also worthy of note. 9. ACKNOWLEDGEMENTS This work was supported by grant No. R01-2005-0001112-0 from the Basic Research Program of the Korea Science & Engineering Foundation and SK communications, Inc. We thank Jeongsu Hong and Jaehyun Lim of SK communications, Inc. for providing the Cyworld data. 11. REFERENCES [1] Techweb. http://www.techweb.com. [2] G. Kossinets. Effects of missing data in so cial networks. Preprint, arXiv.org:cond-mat/0306335, 2003. [3] A.-L. Barabasi and R. Alb ert. Emergence of scaling in random networks. Science, 286:509, 1999. 844