Structural and Algorithmic Aspects of Massive Social Networks Stephen Eubank V.S. Anil Kumar Madhav V. Marathe Aravind Srinivasan Nan Wang Abstract We study the algorithmic and structural properties of very large, realistic social contact networks. We consider the social network for the city of Portland, Oregon, USA, developed as a part of the TRANSIMS/EpiSims pro ject at the Los Alamos National Laboratory. The most expressive social contact network is a bipartite graph, with two types of nodes: people and locations ; edges represent people visiting locations on a typical day. Three types of results are presented. (i) Our empirical results show that many basic characteristics of the dataset are well-modeled by a random graph approach suggested by Fan Chung Graham and Lincoln Lu (the CL-model ), with a power-law degree distribution. (ii) We obtain fast approximation algorithms for computing basic structural properties such as clustering coefficients and shortest paths distribution. We also study the dominating set problem for such networks; this problem arose in connection with optimal sensor-placement for disease-detection. We present a fast approximation algorithm for computing near-optimal dominating sets. (iii) Given the close approximations provided by the CLmodel to our original dataset and the large data-volume, we investigate fast methods for generating such random graphs. We present methods that can generate such a random network in near-linear time, and show that these variants asymptotically share many key features of the CL-model, and also match the Portland social network. The structural results have been used to study the impact of policy decisions for controlling large-scale epidemics in urban environments. Basic and Applied Simulation Science (CCS-5) and National Infrastructure and Analysis Center (NISAC), Los Alamos National Laboratory, MS M997, P.O. Box 1663, Los Alamos, NM 87545. Email: {eubank,anil,marathe}@lanl.gov. This work is supported by the Department of Energy under Contract W-7405ENG-36. Department of Computer Science and Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742. Email: srin@cs.umd.edu. Department of Computer Science, University of Maryland, College Park, MD 20742. Most of the work was done while visiting Los Alamos National laboratory. Email: nwang@cs.umd.edu. 1 Intro duction The rapidly expanding population of today's towns and cities has resulted in very high population densities and social connectivity. Understanding the urban socialcontact structure is critical for social scientists, urban planners, infrastructure companies, and governments. For example, the spatial distribution of population in a city, people's movement-patterns and their phonecalling patterns have a direct bearing on how wireline and wireless infrastructure providers design their networks. Similarly, the social contact network also determines the spread of an infectious disease. For instance, one of the astonishing features of the recent SARS outbreak was the speed and the extent to with which the epidemic spread: this is a demonstration of how "connected" the world has become. For many societies, this raised the important questions of how to detect the disease quickly, and how to control it (by quarantining, vaccination etc.). Because of the significant logistics involved, this needs to be done with as little cost as possible. This work does an empirical and analytical study of various structural and algorithmic issues of massive social networks. Our work is a part of the EpiSims pro ject. EpiSims is a simulation-based analytical tool to study the spread of infectious diseases in an Urban environment being developed at the Los Alamos National Laboratory [11]. In order to understand the spread of contagious diseases, we need a realistic representation of a social network. The Transportation Analysis and Simulation System (TRANSIMS) developed at LANL produces estimates of the social network in a large urban area based on the assumption that the transportation infrastructure constrains people's choices about what activities to perform and where to perform them [15]. TRANSIMS produces locations of all travelers on a second-by-second basis; see [15] and http://transims.tsasa.lanl.gov for more information. EpiSims is a tool for simulating the spread of epidemics at the level of individuals in a large urban region, taking into account realistic contact patterns and disease transmission characteristics. It integrates models for disease propagation within a host, transmission between hosts, and contact patterns among hosts Copyright © 2004 by the Association for Computing Machinery, Inc. and the Society for industrial and Applied Mathematics. All Rights reserved. Printed in The United States of America. No part of this book may be reproduced, stored, or transmitted in any manner without the written permission of the publisher. For information, write to the Association for Computing Machinery, 1515 Broadway, New York, NY 10036 and the Society for Industrial and Applied Mathematics, 3600 University City Science Center, Philadelphia, PA 19104-2688 718 to create a realistic microscopic epidemic simulation. It provides estimates of both the geographic and demographic distribution of disease as a function of time. EpiSims provides a natural representation for assessing the capability of public health infrastructure to defend urban populations against disease outbreaks. Tools such as vaccination campaigns, quarantine and isolation, and contact tracing can be modeled easily within the simulation, even if they are demographically targeted (e.g., immunizing children). Structural analysis of social contact networks is an integral part of EpiSims and TRANSIMS. There are three key parameters to develop: (i) new graph properties that serve as structural invariants and model a given policy/design question at hand, (ii) mathematical theories that explain the evolution of these graphs and also serve as models generating random ensembles of such networks, and (iii) fast computational methods for solving problems arising in (i) and (ii). For example, recent results have shown that many real-world social and infrastructure networks cannot be modeled by the Erds-Renyi model G(n, p) [3]. New mathematical o models have been proposed for generating generic "real world networks". Recent work on structural analysis of large real-world networks started with experimental work on the structure of the Internet/WWW (see, e.g., [2, 7, 18]). One of the striking results of such studies was the power-law nature of the degree distribution. Since models such as G(n, p) do not have this property, these studies have sparked research on theoretical models to explain such properties. These models are broadly of two types: (i) incremental models for constructing such graphs, using primitives such as preferential attachment (see, e.g., [5, 6, 9, 20, 21]): each new vertex chooses its set of neighbors randomly according to some function of the degrees of previous vertices; and (ii) assume a given degree sequence, and generate random graphs either by putting each edge independently, with a probability dependent on the degrees of the end-points [8], or generating a graph uniformly at random from the space of graphs with the given degree-sequence [22]. Structural properties of networks1 that have become classical include: degree distribution, clustering coefficient distributions, shortest-path distribution, and connected components. Most of the work on the above models has been toward understanding such structural properties [9, 6, 8, 22]. They serve as important invariants much like physical laws that serve as invariants for physical systems. In addition to such structural results, there has been much work in classical random graph 1 See http://cnls.lanl.gov/networks for a recent workshop on this and related topics. theory on algorithms for problems that are usually NPhard in the worst case [17]. No such studies seem to have been carried out yet for random graph models for social networks. Combinatorial problems such as dominating sets, cuts and flows are useful in this setting. One challenge here is that these graphs are typically very large (of the order of a million nodes or more), since they represent models for social interaction, and computing even simple quantities on such massive graphs is nontrivial. This raises the need for very fast algorithms. Another important problem that has not received attention in this context so far, is that of efficient generation of these graphs. For instance, in the model of [8] (the CL-model ), implementation requires quadratic time, even when the average degree is small. For very large graphs with millions of nodes, a fast-generation mechanism that takes time of the order of the number of edges, but still produces graphs with meaningful structure, would be very useful (e.g., for sensitivity analysis). The starting point of our work is a realistic social contact structure based on data from Portland, Oregon, USA. The main component of this dataset that is relevant to us is a social contact network G(P L, E ), which is a bipartite graph with a set P of people and a set L of locations.2 An edge (p, ) is present if person p visits location on a typical day. The dataset is massive: |P | 1.6 million; |L| 1.8 × 105 , and |E | 6 million. Our contributions are threefold. (a) Fast Generation of Network Mo dels. One key idea of the CL-model for massive graphs [8], is that each node u has a given "weight" d(u) corresponding to its expected degree, such that nodes tend to attract edges with probability proportional to their weight. (In the WWW context, one such situation is where popular sites tend to get even more inward links over time.) In our context, the CL model would take the sets {d(pp : p P }d and {d( ) : ) L} as input, ( ) = , and independently put where d(p) = the edge (p, ) with probability d(p) · d( )/; note that the expected degree of any node in P L equals its given weight. Such a weight-based model seems plausible in the social-network case also: a popular tourist-spot or massive office-structure will likely get even more visitors, and a person such as a bus-driver making long trips is likely to visit new locations (within limits). We discuss the empirical validity of this model in (c) below. Note that this graph-generation takes time (|P | · |L|), which is infeasible in our case: such models typically have to be generated independently many times, e.g., for simulations. Thus, we first present two natural 2 For diseases such as smallp ox L can b e restricted to buildings; the contact times in city public transport are usually inadequate. 719 variants of the CL-model, FastGen-1 and FastGen-2, and show how they can be implemented in time nearlylinear in the number of edges. We show analytically that FastGen-1 asymptotically matches the CL-model for various parameters of interest, and give reasons for why FastGen-2 may not. The empirical validity of these models is discussed in (c). (b) Fast Algorithms. Given the outbreak of a disease, one way of detecting it is to place sensors at various locations (e.g., entrances to schools and malls) so that any infected individual can be detected. Since placing and monitoring sensors is an expensive task, a policy planner would like to place as few sensors as possible, but still detect the disease (or at least a vast ma jority of the non-quarantined infected individuals). This is the minimum dominating set problem: find a smallest subset of L so that each person, or at least a (1 - )fraction of P visits one such location. We are interested in the fast computation of a near-optimal-sized L L that dominates P , or at least a (1 - )-fraction of P (we call it the (1 - )-dominating set problem), for a given constant . Dominating set cannot be approximated to within (1 - (1)) ln |P | [19]; even the standard greedy algorithm, which takes nearly quadratic time, is infeasible for our large networks. For the contact graph in EpiSims, we show that the following FastGreedy algorithm gives a very good approximation to the (1 - )-dominating set problem: select the locations in decreasing order of degrees until a (1 - )-fraction of the people are dominated. It turns out that this heuristic works well for us because of the high overlap ratios of the locations (defined in Section 2) and high values imply low overlaps. For our data and our models, FastGreedy takes a few seconds, as compared to a few hours for the standard greedy algorithm. We show that FastGreedy gives (1 + o(1))­approximations (to the (1 - )-dominating set problem), for the CL-model and for our FastGen-1 model. We also show that for these models, an even simpler sampling based algorithm is good: sample each person independently and choose a subset of locations visited by this sampled set having a certain bound on the sum of degrees. This is really practical for a policy planner, because we only need to know the degrees with respect to the sample. We also need fast algorithms for computing various parameters of our graphs, e.g., clustering coefficients and shortestpath distribution. It is simple to show that sampling a "small" number of vertices yields the former with high accuracy. For the latter, we show the following. Given an arbitrary graph G = (V , E ) with |V | = n and a parameter , suppose we aim to estimate with high accuracy, the number S P D(i) of pairs (u, v ) that are at distance i from each other, only for those i for which |S P D(i)| · n2 (e.g., in practice, those i that occupy less than 0.1% of the shortest-path distribution may be irrelevant). We show that a small number of singlesource shortest paths computations run from randomly generated sources, is sufficient. (c) Empirical Evaluations. We carry out an extensive empirical analysis of our results, to show that: (i) the algorithms for computing structural properties and generating graphs run very fast and provide very good approximations, and (ii) the CL-model and the fastgeneration variant(s) provide good approximations to the real social network for most of the graph parameters studied here. Thus, the CL-model and its variants can be valuable for carrying out sensitivity analysis of our methods as they relate to policy implications. The results presented here have been combined with the simulations in EpiSims in two recently concluded studies done by Los Alamos for the Dept. of Homeland Security. Recently, there has been an active debate over adopting response strategies for controlling smallpox [14, 4, 12]. The results in [13] show the importance of early detection and aggressive quaratining/vaccination to control the spread of diseases in cities such as Portland. See [4, 12] and our website http://www.ccs.lanl.gov/ccs5/index.shtml. 2 Preliminaries We will work with the bipartite graph G(P L, E ) defined in the introduction, as well as some related graphs. As pointed out in Section 5, the degreedistribution of L is well-approximated by a power law with exponent > 2: i.e., the number of locations | in L with degree i is close to ni = ciL| , where c is a normalization constant. In the actual dataset, 2.8; we will work with an arbitrary constant > 2. We let d0 and d1 denote the minimum and maximum location degrees, respectively; we will have d0 d1 , i.e., the location degrees exhibit large variations. The people degrees are small as expected, since a person cannot visit too many locations on a typical day; they are sharply concentrated around their average value, which will throughout be denoted wp (and is approximately 4 in our dataset). Counting the numbers of edges from P and from L, we get (2.1) |P | · wp -1 · | L| · d 0 . -2 In urban settings, it is reasonable to assume that |L| grows at a much slower rate than |P | (after a city becomes larger than some critical size, one starts having lots of high-rise locations). Since wp and > 2 are small constants, (2.1) then implies that d0 must be large. So we will assume for our mathematical models that d0 = 720 (1): i.e., a function of |P | that increases unboundedly. (A typical choice could be some not-too-fast-growing function, such as polylog(|P |).) We will also consider two other graphs induced by G. The people-people network, GP is the pro jection of G onto P : the nodes of GP are P and (p, p ) is an edge if there is a common location L such that (p, ), (p , ) E (G). GP is a key network since it corresponds to contact between people, and hence represents the spread of epidemics through personal contact. The location-location graph GL is also defined similarly. When developing random graph models of G(P L, E ) (related to the CL-model) given by the degreesequences of P and L, we will make the assumptions of the above paragraph, for some arbitrary constant > 2. Recall from Section 1 that the CL-model will take such a degree-sequence and then generate a random graph with edges put independently, with probability proportional to the product of the expected degrees. For a node v P L, N (v ) is the set of nodes adjacent tv v . For a o set V P L, N (V ) is defined as N (V ) v = V N (v ); d(v ). For also, d(V ) is defined to be d(V ) = V a set L L, the overlap ratio of L is defined to be |N (L )| divided by the sum of degrees of nodes in L . The overlap ratio is a quantity in (0, 1] and a larger value indicates lower overlap. The clustering coefficient [25] is a measure popular with social scientists to study interaction graphs. The clustering coefficient C (v ) for node v is defined to be the number of edges (u, w) d ; such that u, w N (v ), divided by eg(v) deg (v ) 2 denotes the degree of v . For any integer i, C (i) is the average clustering coefficient of vertices of degree i. The clustering coefficient is 0 for every node in the bipartite graph, and therefore, we consider it only for GP . 3 Fast Generation of our Massive Graphs following model, where the sequence of location-degrees equals D(L) with probability 1. We will guarantee the following properties, by a careful implementation of the approach of [24]: (a) p P, L, Pr[Xp, = 1] = d(p)·d( ) ; (b) L, Pr[|{p : Xp, = 1}| = d( )] = 1; (c) L, the following "negative correlation" properties hold for all subsets P P : p p Pr[ Xp, = 0)] Xp, = 0], Pr[ p P ( Xp, P ( = 1)] p P [ Xp, P [ = 1]. The first property implies that each edge is put in with the right probability. The second property ensures that the degrees of nodes in L are equal to the expected degrees. (Our algorithm can be trivially modified so that degrees in P are satisfied exactly, instead: that model is called FastGen-2.) The third property allows us to use Chernoff-like bounds, and we will use it later to show that several measures like overlap ratios, clustering coefficients etc. are preserved in this model. The FastGen Algorithm. The discussion here is for model FastGen-1. Algorithm FastGen is a careful implementation of the approach of [24] and is described in Figure 1. Lemma 3.1. For any positive integer q and any index q-1 c k , q/2 < i=0 w(blokk+i ) · max L d( ) 1. Also, the q q total number of blocks is at most 2(max L d( ) + 1). Lemma 3.2. For each L, the number of super blocks in step 4(a) of FastGen is at most 2(d( ) + 1). Lemma 3.3. Algorithm FastGen satisfies the requirements of the fast generation model and runs in time O( · log |P |). 3.1 Analytical Validation of FastGen-1 We now briefly explain why model FastGen-1 is a close approximation to that of [8] for the degree-distribution of the people-people graph, clustering coefficient, overlap ratios, etc. Similar remarks also hold for the distribution of shortest-path lengths, and we will provide detailed proofs in the full version. Consider the degree-distribution of the peoplepeople graph. Recall that a general model we are employing for the bipartite graph (P, L) is where: (a) the weight of any person is bounded by a constant C , and (b) the weight of the locations follow a power-law with exponent > 2 with weights running in the range [d0 , d1 ], where d0 is a slowly-growing function of |P | such as polylog(|P |). Since the number of locations with As explained in Section 1, a naive generation of random instances using the model of [8] takes |P | · |L| steps, because we have to try out each tuple (p, ) P × L. The number of edges, on the other hand, is O(|P | · wp ), which is essentially O(|P |). This is a lot smaller than |P | · |L| for the scale of data sets we are looking at, and naturally leads to the question whether instances can be generated in nearly O(|P | · wp ) time. We describe a method to do this in this section. The Fast-Generation Mo dels. Formally, we are given two disjoint sets P and L, and two positive integral degree sequences D(P ) = {d(p) : pp P } and D(L) = {d( ) : L} such that = P d(p) = d( ) and maxpP d(p) · max L d( ) . Let the L random variable Xp, denote the event that there is an edge between p P and L. FastGen-1 will be the 721 Algorithm FastGen 1. Order the elements in P in an arbitrary order, (p1 , p2 , · · ·). P 2. Compute prefix sums psi = i =1 d(pj ) for 1 i |P |, and define ps0 = 0. j 3. Partition the elements in P into a sequence of z blocks block1 , block2 , · · · blockz satisfying the following properties. (a) Each block consists of a (contiguous) subsequence of P , e.g., blockk = (pjk , pjk +1 , · · · , pjk+1 -1 ). P (b) Let w(blockk ) = pj blockk d(pj ). The last block satisfies 0 < w(blockz ) · max L d( ) . (c) For each k < z , let pjk be the first element in blockk , the k-th block satisfies (3.2) 4. For each (a) Let q - d(pjk+1 ) · max d( ) < w(blockk ) · max d( ) ; L L L, choose d( ) edges incident on = by the following steps. m ax L d( ) d( ) 1. Group the blocks into super-blocks: B lock1 B lock2 ··· = = (block1 , block2 , · · · , blockq ), (blockq +1 , blockq +2 , · · · , block2·q ), and so on. By Lemma 3.2, there are at most 2(d( ) + 1) super-blocks. (b) Let the random variable Yi denote the event that there is an edge between Run the algorithm of [24] to determine if Yi = 1 or 0. and the super-block B locki . (c) Suppose B locki = (pti , pti +1 , · · · , pti +j ) which is a subsequence of the ordered people. If Yi = 1, choose one p B locki with probability proportional to its weight d(p) and add the edge ( , p). This can be done by first choosing an integer r uniformly at random from the interval [psti -1 , psti +j ), locating px B locki by running binary search on the (increasing) sequence (psti , psti +1 , · · · , psti +j ) and finding the index x such that psx-1 r < psx . Figure 1: Description of the algorithm for fast generation of random graphs. degree d1 must be nonzero, it can be shown using the definition of power-law that d1 O(d0 · |L|1/ ) O(|P |1/2-(1) ). Consider a person p. Its number of neighbors is distributed asymptotically as Poisson(d(p)), both in the model of [8] and in FastGen-1. Since d(p) C , we have in particular that with high probability, the number of neighbors is at most 2 ln |P |/(ln ln |P |). So, since each of these neighbors has degree at most |P |1/2-(1) as seen above, the total degree of these neighbors is at most |P |1/2-(1) ; thus, these neighboring locations of p will, in turn, have almost-disjoint neighborhoods in the set P , with high probability. Thus, the number of neighbors of p in the people-people graph is essentially the sum of the degrees of its neighbors, measured in the graph (P, L). This sum is sharply concentrated around its mean, as it is a sum of independent random variables with mean (1) in the model of [8]; in FastGen-1, the concentration holds trivially with probability 1 (since the location-degrees equal the prespecified degrees with probability 1). A similar argument holds for the overlap ratios. As for the clustering coefficiep t of the people-people graph, n recall that it equals ( P C (p))/|P |, where C (p) is the number of edges N E (p) in the neighborhood N (p) N (p) . of p in the people-people graph, divided by 2 Consider the random variable C (p) for any fixed p, in both of our modeN of interest. As sketched above, ls i (p) the denominator s highly concentrated around 2 a value a; thus, E[C (p)] is essentially (1/a) · E[N E (p)]. Now, it can be shown that the main type of conditioning in the calculation of E[N E (p)], involves conditioning on a small constant number of edges incident on each of a set of locations L . This conditioning poses no problems in the CL-model, due to its full independence. It essentially poses no problem in model FastGen1 either, since the locations make their edge-choices independently, and have expected degree (1). It can also be shown that FastGen-2 is in general not a good 722 approximation to [8] in the context of the above-seen parameters, basically because the weights d(p) of the people are all small. In particular, conditional on the presence of an edge (p, l) in (P, L), the distribution of other edges incident on p becomes altered significantly in FastGen-2. We will present the details in the full version. certain constant. For any sampling probability 2 (ln |P |)/d0 , the solution L chosen by the SampledGreedy algorithm is also a (1 + o(1))-approximation solution to the (1 - )-dominating set problem of the whole graph, with high probability. 4.2 Shortest Path and Clustering Co efficient Distribution Two basic statistics related to the people-people grapp GP are the average clustering coh 4 Fast Algorithms efficient (c.c.) ( P C (p))/|P |, and the distribution 4.1 Dominating Set The FastGreedy heuristic is the following simple algorithm for computing a domi- of shortest paths. The former is easy to estimate by nating set. Consider the locations in L in nonincreasing standard sampling, since each C (p) lies in [0, 1]. Choosorder of d(·): 1, 2, . . . , |L| with d( 1) d( 2) . . . ing (log |P |) vertices at random and outputting their d( |L| ), where d(·) is the gijven degree-sequence. Pick mean c.c., is easily shown via a Chernoff bound to give the smallest i such that | i N ( j)| (1 - )|P | and an accurate estimate. For shortest paths, consider estitake the subset { 1, . . . , i} to be the dominating set mating the values S P D(·) defined in Section 1, for the (for dominating (1 - )-fraction of the people). For any graph GP . Suppose we only wish to do so (with some relative error ) for all |S P D(i)| |P |2 . Simple samL L, define d(L ) = L d( ). Consider first the case where d(p) is the same for all p P (recall that pling works again: choose each vertex-pair in an i.i.d. the people-degrees are highly concentrated around the manner, so that the expected number of choices is of 2 -1 mean). A key lemma that we can show for both the the form (( ) · ln(1/ )), find the distance between each chosen pair, and output the distribution obtained. CL-model and the FastGen-1, is: A similar sample complexity, using single-source shortLemma 4.1. There is a certain parameter = o(1) such est paths from random sources, can be shown using a that the fol lowing holds simultaneously for al l L L tail bound from [23]. (1 with high probability: if d(L ) |P | · ln1+/ ) , then ) ) |N (L | < (1 - ) · |P |, and if d(L |P | · ln(1/ ), 5 Exp erimental Analysis and Validation then |N (L )| (1 - ) · |P |. We now present empirical results to strengthen the Using this, we can show that FastGreedy is a (1 + o(1))-approximation algorithm to the (1 - )dominating set problem; by a similar argument, we get that it is an O(1)-approximation algorithm (with a small constant hidden in the O(1)) when the people-degrees vary somewhat. Domination Using Sampling. FastGreedy and the usual greedy algorithm, require knowing the structure of the graph, and the degrees of locations. Once this data is available, these algorithms are easy to implement, but collecting such data is a non-trivial task. In practice, any simulation (such as EpiSims) builds models based on a small sample of data from the whole population. This motivates the following question: can one get a good solution to the domination problem by only sampling a fraction of the data? We show here that this is possible, using the following SampledGreedy algorithm. Choose a set S P by sampling each person independently with probability . Compute the set N (S ) of locations visited by S and choose the smallest set L N (S ) whose sum of degrees (restricted in S ) is at least |P | · ln(1/ ). L is output as the solution for the whole graph. Lemma 4.2. Assume d0 a log |P |, where a is a claims in Section 3.1. We also compare the results of the random graph models with the original social contact network for Portland. For this section we denote this network as the Real-Network. Recall, that the degree sequence of this network is used as input by the random graph models. All the experiments were done on a shared machine, Sun UltraSPARCIII, 750 MHz CPU, 8.0 GB main memory. In all our experiments, we ignored isolated vertices, i.e., vertices with degree 0. For example, in Table 1, the percentages of locations and people for the three models are the percentages of how many non-isolated locations and people are in the generated graph, compared to the Real-Network. In all random graphs that we generate, there is a giant connected component consisting of almost all vertices. By pro jecting the people-location bipartite graph into the people-people graph and the location-location graph, we examine the sizes of the giant components in these two graphs separately. Since the CL-model is expensive to run, we only generated a few instances by this model. For all the three random graph models we found that the values are sharply concentrated. Hence instead of taking averages, the data in Table 1 is for only one instance from each model. We study the following measures for these random 723 graph models and the EpiSims data. (i) Degree distribution in the bipartite graph and in the induced peoplepeople graph; (ii) Overlap ratios; (iii) Size of the giant component in the people-people graph and the locationlocation graph; (iv) Shortest paths and clustering coefficients; (v) Quality of the FastGreedy algorithm for domination. Table 1 compares some basic parameters between the Real-Network and the graphs generated by the random models. The size of giant indicates the number of people (locations) in the giant connected component of the people-people (location-location) graph. Note that FastGen-1 preserves the number of locations and FastGen-2 preserves the number of people. Both of them preserves the number of edges. (a) Degree Distributions. Figure 2 compares the degree distributions of locations in the Real-Network and the random graphs generated by CL- and FastGen2 models. One can see from the figure that the three distributions are very close to each other. Note that FastGen-1 preserves location degrees. Also note that a large part of the degree distribution of locations exhibits a power-law. This part starts at degree approximately 20 and ends at degree about 200. For locations of degree k in [20, 200], the number of them is proporL c tional to |k| , where 2.8, i.e., |Lk | = k|2L8| , where . Lk = {l L : deg (l) = k }, 20 k 200, and c 200. Next consider the degree distribution of the giant component of the people-people graph GP . The experimental results and the theoretical analysis (Section 3.1) together confirm that the CL- and FastGen-1 models match the Real-Network much more closely than FastGen-2. (b) Overlap Ratios. We now study two variants of overlap ratios for locations in the bipartite graph. For any positive integer d, the point ov erlap ratio(d) is the overlap ratio for the set L = { L : d( ) = d} and the cumulativ e ov erlap ratio(d) is the overlap ratio of the set L = { L : d( ) d}. Note that all the overlap ratios are in (0, 1]. The higher the overlap ratio is, the better the FastGreedy algorithm performs for the dominating set problem. The results are shown in Figure 3. From both experiments (Figure 3) and the theoretical analysis (Section 3.1) we can conclude that the overlap ratios of graphs generated by FastGen-1 are close to the graphs generated by CL-model and both of them are close to the Real-Network. On the other hand, the overlap ratios of graphs generated by FastGen-2 are much higher than the corresponding overlap ratios in graphs generated by the other two models and the real-Network. Thus we can expect that FastGreedy should perform much better for the graphs generated by FastGen-2 than by the others. (c) Shortest Paths and Clustering Co efficients. Two popular measures people study on social networks are the shortest path distribution and the clustering coefficient distribution. Often, these give important connectivity information for the contact networks. Both of these can be computed easily in polynomial time, but usual algorithms take (|P |2 ) time, which is infeasible for such large graphs. This motivates faster methods to approximate these distributions. We show simple random sampling based algorithms for these two problems, and show their empirical performance. For practical reasons, we examine the shortest paths and the clustering coefficients only in the giant component of the people-people graph. (c1) Shortest paths. Since log2 |P | 20, we did seven experiments each by uniformly and independently sampling about a hundred vertices (i.e., people) in the EpiSims graph and computed the shortest-path spanning tree for each sampled vertex. The fraction of shortest paths of distance i in the giant-component is estimated by the fraction of paths of distance i in all the shortest-path spanning trees. Similarly we did the same experiments for the CL-model, FastGen-1 and FastGen-2. For each model, nine experiments were done and in each experiment about a hundred vertices were sampled. The means and the standard deviations of the experiments for the EpiSims data and the three models are presented in Table 2. From the table we can see that the distance between most pairs of people in the giant connected component is 2 or 3. (c2) Clustering co efficients. By the distribution of the clustering coefficients, we examine for each degree i 2, the average clustering coefficient C (i) in the giant component of the people-people graph. We present two fast algorithms to approximate the distribution of the clustering coefficients: Fast-algorithm-1 provides a lower bound to the distribution of the clustering coefficients. For a p P , arbitrarily order the locations in N (p) as l1 , · · · , ld , where d = |N (p)| is the degree of p in the bipartite ^ graph. Let N (l1 ) = N (l1 )\{p}, and for i = 2 to d, let i-1 ^ ^ N (li ) = N (li )\{p}\ j =1 N (lj ). The lower bound of the i i clustering coefficient for p is (|N (N=1 ))|-1)(|N (Ni(p))|-2) , (p where N (N (p)) denotes the set of neighbors (in people) of the set N (p) (in locations). Due to the high overlap ratio of the set N (p) L, we can approximate people-people degree of p by |E (N (p))| - |N (p)|, where E (N (p)) is the set of edges incident on the set N (p). Fast-algorithm-2 exploits these two properties. Pd ^ ^ |N (l )|(|N (l )|-1) Fast-algorithm-2 approximates the lower bound given 724 number of locations size of giant(locations) number of people size of giant(people) number of edges average degree (people) time for generating a graph Real-Network 181230 181192 1615860 1615813 6060679 3.7507 CL 178746 (98.63%) 178571 1507234 (93.28%) 1507054 6065637 (100.08%) 4.0227 > 10 hours FastGen-1 181230 (100%) 181088 1507291 (93.28%) 1507148 6060679 (100%) 4.0209 < 40 seconds FastGen-2 178668 (98.59%) 178611 1615860 (100%) 1615803 6060679 (100%) 3.7507 < 30 seconds Table 1: Comparing the basic graph theoretic parameters for the Real-Network and randomly generated graphs. The last row compares the time it requires to generate such graphs. 10 -1 10 -1 10 -1 EpiSims Fraction of locations with a given degree Fraction of locations with a given degree 10 -3 Degree distribution of locations (log-log) 10 -3 Fraction of locations with a given degree Degree distribution of locations (log-log) 10 -2 EpiSims CL-model 10 -2 EpiSims FastGen-1 Degree distribution of locations (log-log) 10 -2 10 -3 10 -4 10 -4 10 -4 10 -5 10 -5 10 -5 10 0 10 1 Location degrees 10 2 10 3 10 4 10 -6 10 0 10 1 Location degrees 10 2 10 3 10 4 10 -6 10 0 10 1 Location degrees 10 2 10 3 10 4 10 -6 Figure 2: Degree distribution of locations. Point overlap ratios 10 0 10 0 10 0 Point overlap ratios 10 -0.01 Point overlap ratios (log-log) EpiSims CL-model 10 -0.01 Point overlap ratios (log-log) EpiSims FastGen-1 Point overlap ratios Point overlap ratios (log-log) EpiSims FastGen-2 0 10 -0.01 10 0 10 1 Location degrees 10 2 10 3 10 4 10 0 10 1 Location degrees 10 2 10 3 10 4 10 10 1 Location degrees 10 2 10 3 10 4 Figure 3: Point overlap ratios of the Real-Network and the three models. len=1 8.0806 × 10-4 9.1226 × 10-5 9.7835 × 10-4 1.1151 × 10-4 0.0010 1.5816 × 10-4 8.4945 × 10-4 6.9460 × 10-5 len=2 0.2666 0.0149 0.4264 0.0241 0.4161 0.0300 0.3622 0.0177 len=3 0.7126 0.0138 0.5665 0.0223 0.5756 0.0282 0.6344 0.0181 len=4 0.0200 0.0049 0.0061 0.0025 0.0072 0.0028 0.0026 0.0019 len=5 4.8687 × 10-5 8.8549 × 10-5 1.0460 × 10-6 5.1511 × 10-7 4.6721 × 10-7 2.0310 × 10-7 2.9173 × 10-7 5.3671 × 10-7 len=6 5.6341 × 10-9 9.8875 × 10-9 Real-Network CL FastGen-1 FastGen-2 mean std. mean std. mean std. mean std. Table 2: Means and standard deviations of fractions of shortest paths of different lengths in the single-source shortest-path spanning trees sampled from the giant component of the people-people graphs. 725 Fraction of dominating locations 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 Performance of the two fast greedy algorithms on the EpiSim graph FastGreedy-1 FastGreedy-2 Fraction of dominating locations 1 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 Performance of the two fast greedy algorithms on the CL-model FastGreedy-1 FastGreedy-2 0.2 0.4 0.6 0.8 1 Fraction of dominated people Fraction of dominated people Fraction of dominating locations Fraction of dominating locations 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 Performance of the two fast greedy algorithms on the FastGen-1 model FastGreedy-1 FastGreedy-2 Performance of the two fast greedy algorithms on the FastGen-2 model FastGreedy-1 FastGreedy-2 0.2 0.4 0.6 0.8 1 Fraction of dominated people Fraction of dominated people Figure 4: Performances of the two fast greedy algorithms in the Real-Network and the three models. in Fast-algorithm-1. Instead of computing the lower bound of thPclustering coefficient for p, we approximate e d |N (l )-1|(|N (l )|-2) it by (|E (N (p))i=1N (p)i|)(|E (N (p)i)|-|N (p)|-1) . Without com|-| ^ puting N (li ) and N (N (p)), this algorithm is much faster than the first one. Both the algorithms yield a very good approximation to the distribution of the clustering coefficients in the people-people graph. We also estimate the average clustering coefficient (c.c.) of the giant component of the people-people graph by sampling about a hundred vertices in each experiment. We did seven experiments respectively for the Real-network and the graphs generated by the three models. We calculated the mean and the standard deviation of the seven experiments. These values are presented in Table 3 along with the lower bound (l.b.) and the approximated lower bound (a.l.b.) of the clustering coefficients according to the above algorithms. Note that l.b. and a.l.b. are computed for all vertices of the graph, while c.c. is computed only for the sampled vertices, hence it is possible that l.b. and a.l.b. are higher than the sampled c.c. value. (d) Dominating Sets. We considered four versions of greedy algorithms here: RegularGreedy, VRegularGreedy, FastGreedy-1 and FastGreedy-2. RegularGreedy is the standard greedy algorithm that chooses a location that covers the maximum number of currently uncovered people. VRegularGreedy algorithm, works as follows: first find the set of all people of degree exactly one, call it P1 , then find the set of locations adjacent to P1 , i.e. N (P1 ). Obviously, any dominating set must contain N (P1 ). Then we find the set of people adjacent to N (P1 ) (including P1 ), i.e. N (N (P1 )). Obviously N (P1 ) dominates N (N (P1 )). At last, we run the RegularGreedy algorithm on the subgraph (P \N (N (P1 )), L\N (P1 )). FastGreedy-1 is similar to the FastGreedy algorithm but we do not take those j li for which N (li ) 4 hr. 2.5 hr. CL 58.37% 54.47% 54% < 15 sec. > 5 hr. 0.5 hr. FastGen-1 59.60% 55.78% 55.30% < 15 sec. > 7 hr. 0.5 hr. FastGen-2 27.90% 27.54% 27.54% < 15 sec. > 2 hr. 0.5 hr. [9] [10] Table 4: [11] Performance of three dominating-set algorithms for the Real-Network and graphs generated by the three models. Percentages denote the sizes of the dominating sets (compared to the whole set of locations) found by different algorithms. Seconds and hours denote the time needed to run these algorithms. FG1 = FastGreedy-1, RG = RegularGreedy, VRG = VRegularGreedy, TFG1 = Time of FastGreedy-1, TRG = Time of RegularGreedy, TVRG = Time of VRegularGreedy, sec. = seconds, hr. = hours. [12] [13] or FastGen-2 model) and the (input) degree sequence, one can figure out (with high confidence) the degree (say ds ) where FastGreedy-2 should stop. Since actual degrees of locations are concentrated sharply around their input-degrees, one can take as the dominating set, those locations with input-degrees at least ds . Additionally, Table 4 shows that although VRegularGreedy is much faster than RegularGreedy and provides the best solution among all the four algorithms, it is still much slower than the FastGreedy algorithms. Also, as we have seen, the FastGreedy algorithms have their own merits in the real world. Acknowledgments. We thank the SODA 2004 reviewers for their helpful comments. References [1] R. Albert and A. Barab´si. Statistical Mechanics of a Complex Networks. Rev. Mod. Phys., 74 (2002), 47-97. [2] R. Albert, A. Barabasi, and H. Jeong. Diameter of the World Wide Web. Nature, 401 (1999), 103-131. [3] N. Alon and J. Spencer. The Probabilistic Method, Second Edition. Wiley, 2000. [4] C. Barrett. Simulation Science: The key to understanding National Infrastructures. Los Alamos Research Quarterly, Spring 2003, available at http://www.ccs.lanl.gov/ccs5/hottop.shtml. [5] A. Barabasi and R. Albert. Emergence of scaling in random networks. Science, 286 (1999), 509-512. [6] B. Bollobas and O. Riordan. Mathematical results on scale free graphs, in Handbook of graphs and networks, S. Bornholdt and H. Schester (ed). Wiley-VCH, Berlin, 2002. [7] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Ra jagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph Structure in the Web. Proc. 9th World Wide Web Conf (2000), 309-320. [8] F. Chung-Graham and L. Lu. Connected components [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] in random graphs with given degree sequences. Annals of Combinatorics, 6 (2002), 125-145. C. Cooper and A. Frieze. A general model of Web Graphs. Random Structures and Algorithms, 22 (2003), 311-335. H. Chen, H. Chang, R. Govindan, S. Jamin, S. Shenker, and W. Willinger. The Origin of Power-Laws in Internet Topologies Revisited. INFOCOM, 2002. S. Eubank and J. Smith. Scalable, Efficient Epidemiological Simulation. Proc. Symposium on Applied Computing, Madrid, 2002. S. Eubank, et al. EpiSims Assessment of Responses to Smal lpox Attack. Report to the Office of Homeland Security, June 2002, Los Alamos Technical Report, LACP-02-254, 2002. S. Eubank, H. Guclu, V.S. Anil Kumar, A. Srinivasan, Z. Toroczkai, and N. Wang. Control ling Smal lpox Epidemics: Strategies drawn from an Empirical Data Instantiated Virtual City. Technical Report, Los Alamos National Laboratory, 2003. E. Kaplan, D. Craft, and L. Wein. Emergency response to a smal lpox attack: the case for mass vaccination. Proc. Natl. Acad. Sci. USA, 99 (2002), 10935-10940. C. Barrett et al. TRANSIMS (TRansportation ANalysis SIMulation System): Volume 0-1: Overview, LAUR-99-1658, Volume 2-6: LA-UR-99-2574-2580. Los Alamos National Laboratory, 1999. C. Barrett, R. Jacob, and M. Marathe. Formal Language Constrained Path Problems. SIAM J. Computing, 30 (3), 809-837, June 2001. A. Frieze and C. McDiarmid. Algorithmic theory of random graphs. Random Structures and Algorithms, 10 (1997), 5-42. M. Faloutsos, P. Faloutsos, and C. Faloutsos. On the power law relationships of the Internet topology. Comput. Comm. Rev., 29 (1999), p. 251. U. Feige. A Threshold of ln n for Approximating Set Cover. JACM, 45(4), 634-652, July 1998. R. Kumar, P. Raghavan, S. Ra jagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. The web as a graph. Proc 19th ACM Symp. Principles of Database Systems, (2000), 1-10. R. Kumar, P. Raghavan, S. Ra jagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. Stochastic models for the web graph. Proc 41st IEEE Symp. Foundations of Computer Science, (2000), 57-65. M. Molloy and B. Reed. A critical point for random graphs with a given degree sequence. Random Structures and Algorithms, 6 (1995), 161-179. S. V. Pemmara ju. Equitable colorings extend ChernoffHoeffding bounds. SODA, (2001), 924-925. A. Srinivasan. Distributions on Level-Sets with Applications to Approximation Algorithms. Proc. IEEE Symposium on Foundations of Computer Science (FOCS 2001), 588-597. D. Watts and S. Strogatz. Col lective dynamics of 'smal l-world' networks. Nature, 393 (1998), 440-442. 727