WWW 2007 / Track: Search Session: Crawlers The Discoverability of the Web Anirban Dasgupta Christopher Olston Ar pita Ghosh Sandeep Pandey Ravi Kumar Andrew Tomkins Yahoo! Research, 701 First Ave, Sunnyvale, CA 94089. {anirban, ar pita, ravikumar, olston, spandey, atomkins}@yahoo-inc.com exactly. Second, pages p6 and p7 may be discovered only by crawling new page p4 . We will study policies for recrawling known pages in order to minimize the overhead of discovering new content. ABSTRACT Previous studies have highlighted the high arrival rate of new content on the web. We study the extent to which this new content can be efficiently discovered by a crawler. Our study has two parts. First, we study the inherent difficulty of the discovery problem using a maximum cover formulation, under an assumption of perfect estimates of likely sources of links to new content. Second, we relax this assumption and study a more realistic setting in which algorithms must use historical statistics to estimate which pages are most likely to yield links to new content. We recommend a simple algorithm that performs comparably to all approaches we consider. We measure the overhead of discovering new content, defined as the average number of fetches required to discover one new page. We show first that with perfect foreknowledge of where to explore for links to new content, it is possible to discover 90% of all new content with under 3% overhead, and 100% of new content with 9% overhead. But actual algorithms, which do not have access to perfect foreknowledge, face a more difficult task: one quarter of new content is simply not amenable to efficient discovery. Of the remaining three quarters, 80% of new content during a given week may be discovered with 160% overhead if content is recrawled fully on a monthly basis. Figure 1: Old pages linking to new pages. Our study has three goals: to characterize the arrival of new content; to provide algorithms for discovery that exploit this characterization; and to measure the overhead of discovering this content for various levels of freshness and coverage. 1.1 Motivation Search engines today have strong freshness requirements at multiple timescales. Within minutes of breaking events, users expect to visit a search engine to gain access to news, blog posts, and other forms of content related to the event. Freshness requirements for such information ranges from minutes to hours. For less immediate content such as reviews of a new product, users are disappointed if a review exists on the web but not in the search engine index; freshness requirements here range from hours to days. And finally, obscure content that meets an information need should make its way to the index within days to weeks. Despite these requirements, there are serious limitations on the ability of the crawler to procure new content in a timely manner. First, bandwidth remains limited, so downloading the entire web every day is not practical. But more importantly, requests per second to an individual website is also limited by politeness rules. Many sites are so large that they cannot be crawled from start to finish within a week under standard politeness assumptions. And many sites report crawler traffic as a significant fraction of total traffic, including multiple downloads of unchanged pages. Thus, careful management of the limited accesses available to a crawler is now mandatory. Additionally, all crawlers must trade off recrawl of existing pages against first crawls of unseen pages; an understanding of the discoverability of new content allows an understanding of the diminishing returns of increasingly aggressive discovery policies. Categories and Subject Descriptors H.3.m [Information Storage and Retrieval]: Miscellaneous General Terms Algorithms, Experimentation, Measurements Keywords Crawling, discovery, set cover, max cover, greedy 1. INTRODUCTION In this paper we are concerned with crawling the web in order to discover newly-arrived content. Figure 1 illustrates the key challenges of our problem. First, page p5 may be discovered by crawling either page p1 or page p3 , introducing a combinatorial cover problem that is NP-hard to solve Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. 421 WWW 2007 / Track: Search Finally, an analysis of the discoverability of the web exposes an evolutionary property of the graph that is not well understood, namely, the mechanism by which new pages are "linked in" to the graph by modifications to old pages. Our lack of understanding of these matters raises concerns about the effectiveness of graph generators and even the effectiveness of the crawling model as an approach to timely discovery of new content going forward. Session: Crawlers Efficient crawl p olicies for content discovery: · What is a good choice of old pages to seed the discovery process, given historical information and a crawl budget? (Section 5) · What fraction of the budget should be spent assessing the usefulness of various old pages, versus exploiting ones already known to be somewhat useful? Our key findings are as follows. We show first that with perfect foreknowledge of where to explore for links to new content, it is possible to discover 90% of all new content with under 3% overhead, and 100% of new content with 9% overhead. But actual algorithms, which do not have access to perfect foreknowledge, face a more difficult task: one quarter of new content is simply not amenable to efficient discovery. Of the remaining three quarters, 80% of new content during a given week may be discovered with 160% overhead if content is recrawled fully on a monthly basis. 1.2 Problem discussion There are two primary mechanisms by which new pages arrive on the web. First, a website puts up a new page, and links to this new page from an existing page. Second, an entirely new website appears and is linked-to from an existing website. A third possible mechanism is that one website puts up a new page without linking to it, and another website provides a link to the new page -- this situation is very uncommon in our data, and we do not study it. The relative fractions of pages appearing as a result of these two mechanisms depends on the elapsed time between observations. As this window shrinks, we will discover new sites at an earlier stage of their growth, and hence an increasing fraction of pages will appear as new pages on old sites. Even when the window is a full month, however, we show that 85­95% of new pages appear on existing sites, suggesting that the problem of analyzing known sites is of paramount importance. We therefore study this problem in greater detail. Before proceeding, we must observe that no web crawler may actually crawl the entire reachable web. Due to infinite websites, spider traps, spam, and other exigencies of the real web, crawlers instead apply a crawl policy to determine when the crawling of a site should be deemed sufficient. Some sites are crawled exhaustively, while others are crawled only partially. In this paper, we focus only on sites that are to be crawled exhaustively, as the remaining sites have been deemed lower priority in terms of absolute coverage. Suppose the crawler has performed an initial complete crawl of some site at time t. Now imagine that at time t + the crawler must revisit the site and find all the new pages. If it is the case that a small set of old pages collectively links to all new pages, then the crawler can in principle discover new pages with minimum overhead. For example, in Figure 1, recrawling just page p1 leads to discovery of all new pages. How well this idea can work on the real web is the sub ject of this paper. The fundamental questions are as follows (this paper tackles several of these, as indicated by the forward pointers below; the remainder are left as future work): Basic feasibility of the approach: · Is it the case for real websites that most new pages can be discovered via a small set of old pages? (Section 4) Key characteristics that determine what crawling approaches are likely to work well: · To what extent are links to new content redundant (as in p1 p5 and p3 p5 in Figure 1)? (Section 4) · Does the set of old pages that link to many new pages tend to remain consistent over time? 1.3 Related work Numerous early web studies focused on properties of a snapshot of the web graph [2, 4, 12, 17, 18]. More recently, attention has turned to evolutionary properties of the corpus. In this evolutionary model, researchers have considered the growth of the web [3], the rates of page and link churn [8, 14, 19], the rates of duplicate evolution [13], and the change rates of individual pages [3, 5, 14, 22]. Parallel to this line of work, there has been a significant body of work on refreshing already-discovered content, which has been studied in [6, 9, 10, 21, 25]. Already-discovered pages are recrawled to keep the search engine local repository fresh so that the search queries are not answered incorrectly due to stale information, while the discovery of new pages is important for ensuring that as many relevant query results are shown as possible. It is tempting to view our problem as equivalent, with new outlinks taking the role of new content on existing pages, but there is a critical distinction: in our problem, many pages can be recrawled, each of which points to a new page, but the value depends on the union rather than the sum. If the pages all point to the same new content, there is very little value from a discoverability standpoint, but great value from the standpoint of the freshness of the recrawled pages. To our knowledge, this specific problem has not been studied previously. Finally, there has been work in ordering the frontier of a crawl [7, 11], in which various policies are studied from the perspective of estimating the quality of a candidate for first-time crawl. This work is orthogonal to ours; once new pages have been discovered, it remains to prioritize them for crawling. 2. PRELIMINARIES 2.1 Formulation A snapshot Gt of a given site at time t is a directed graph (Vt , Et ), where Vt is the set of nodes (pages) and Et is the set of directed edges (hyperlinks). Define Xt = t-1 Vj to j =1 be the set of old nodes at time t, and define Yt = Vt \ Xt to be the set of new nodes at time t. The old nodes Xt are nodes that appeared before time t and the new nodes Yt are nodes that appeared first at time t. 422 WWW 2007 / Track: Search For convenience, we use the following representation for the old and new nodes at any time t. Let Ht = (Xt , Yt , Zt ) be a bipartite graph consisting of the old nodes Xt , the new nodes Yt , and an edge set Zt . This graph will reflect information relevant to our discovery problem, but will not reflect all information in the original graph. An edge z = (x, y ) exists whenever y Yt is efficiently discoverable from x Xt , i.e., there is a path from x to y of the form x y1 y2 · · · yk = y where each yi Yt is a new node. In this case we say that each yi is covered by x0 . Figure 1 shows the motivation for this definition: by crawling a node that reveals the start of a long chain of new nodes, we may now proceed to download the entire chain of new content recursively with no additional discovery overhead (as each node of the chain is new, and hence must be crawled anyway). The problem of discovering new content is then the following: cover the new nodes in Yt using as few nodes from Xt as possible. Combinatorially speaking, there are two natural (unweighted) versions of this problem. The first is called the k-budgeted cover problem, where we are given a budget k, and want to cover as many nodes in Yt as possible using k nodes from Xt . The second is called the -partial cover problem, where we are given 1 and the goal is to cover at least |Yt | nodes in Yt using as few nodes from Xt as possible. Both problems are NP-hard [24]. We study different algorithms for these problems based on the amount of information that is available at the time when a node must be crawled. First, in Section 2.2 we describe an algorithm called Greedy, which has complete information about Ht ; this algorithm should be viewed as an upper bound on the performance of any realistic algorithm.1 Next, in Section 5 we describe a family of algorithms that use information that is realistically available at the time when a node must be crawled. In particular, they do not have access to Ht , but depending on the model, they may have partial information about Xt and statistical information about Ht based on partial information about Ht for t < t. Note that we have not addressed the issue of nodes disappearing/dying between crawls. Our algorithms may be adapted in a straightforward manner to this case, but we focus in this paper on the basic case in which nodes do not disappear. Notation. For each x Xt , we denote by N (x) the set of new nodes efficiently discoverable from x, i.e., N (x) = {y | (x, y ) Zt }. For a subset S of Xt , we define N (S ) = xS N (x). The Jaccard coefficient between two nodes x and y is Jxy = |N (x) N (y )| . |N (x) N (y )| Session: Crawlers read as follows: if 100 new pages may be captured by a cover of size five, then an algorithm must perform five "wasted" fetches, in the sense that they do not return new pages, in order to generate enough information to fetch the 100 new pages. The overhead is 5%, and is a direct measure of the fraction of additional fetches necessary to gather a given number of new pages, in other words, a measure of efficiency. 2.2 An algorithmic upper bound: Greedy While the maximization problem of k-budgeted cover admits a (1 - 1/e)-approximation algorithm, the minimization problem of -partial cover can only be approximated to within a log |Xt | factor [16, 23]. Coincidentally, the same greedy algorithm can be used for both problems. For completeness, we present the greedy algorithm below. In words, the algorithm proceeds by repeatedly returning the old node that covers the most uncovered new nodes. Algorithm Greedy (Xt , Yt , Zt ) Set Ct = . While "not done" do, Find x Xt \ Ct that maximizes |N (x) \ N (Ct )|; break ties arbitrarily. Set Ct = Ct {x}. Return Ct . For the k-budgeted cover problem, the predicate "not done" is true as long as |Ct | k. For the -partial cover problem, this predicate is true as long as |N (Ct )| < |Yt |. 3. DATA We consider two datasets, to address two distinct problems within our scope. First, we consider a sequence of complete crawls of a number of websites. This dataset allows us to study in detail the process by which new pages on a site are incorporated into the existing graph of the site. Second, we consider a sequence of complete crawls of the Chilean web. This dataset by contrast allows us to study inter-site linking, and particularly, the problem of discovering entirely new websites. We describe these two datasets below. Site recrawl dataset. We consider a repeated crawl of 200 web sites over a period of many weeks. This dataset was used in earlier work by Ntoulas, Cho, and Olston; see [20] for more details about the crawl and the principles used to select the web sites. The authors of that work have continued to collect data, and have generously allowed us to employ more recent snapshots than those in their reported results. Of the 200 web sites they crawl, we removed those sites that contained fewer than 100 pages in any snapshot (i.e., the site did not have significant size) or more than 200,000 pages (which was a crawler-imposed upper bound on the number of pages per site, introducing skew into the analysis of new pages). This resulted in 77 sites. Of these sites, we selected 42 that were well-represented at each snapshot, and that did not show any gross anomalies. The 42 websites in the results dataset were crawled repeatedly over a period of 23 weeks from 11/14/2004 to 6/12/2005 (the crawler did not execute during every week). The total number of pages at the first timestep was 640,489 and 223,435 new pages appeared over this period, of which about 40% are directly linked to some old page. A Jaccard coefficient close to 1 means that x and y point to a very similar set of nodes, and a value close to 0 means that they are almost non-overlapping. Key metric: overhead. In general, if a set O of old pages are crawled to discover |N (O)| new pages, then we define the overhead of O as |O|/|N (O)|. Overhead numbers should be 1 This algorithm is not strictly speaking an upper bound, as it makes approximations in order to solve an NP-hard problem; however, the information available to the algorithm allows it to perform substantially better than any realistic algorithm we have seen. 423 WWW 2007 / Track: Search For each of the web sites and for each snapshot, we first parsed the crawl output, and extracted the outlinks and redirect information. We omitted all off-site links and focused only on on-site links. We also discarded orphans -- pages in Yt that are not covered by any page in Xt . Orphans accounted for less than 5% of the new pages in our dataset. We then constructed the bipartite graph Ht defined above for the purposes of analysis; recall that this step involves examining paths from old pages to new pages. Chilean web dataset. We employ a new data set to study this problem, based on the Chilean web. We have three snapshots of the Chilean web, based on complete crawls performed monthly for three months; the first snapshot had 7.40M pages and 67.50M edges and the third snapshot had 7.43M pages and 70.66M edges. Session: Crawlers 4. MEASUREMENTS In this section we present a series of measurements on both of our datasets. In addition to basic properties of the data, we will study in detail the extent to which algorithm Greedy is able to efficiently cover the new content. We will begin with a series of experiments on the site recrawl dataset, studying the discovery of new pages on existing sites. Based on the results of this analysis, we will then turn in Section 4.4 to an analysis of the Chilean dataset, in which we will study the relative prominence of new pages on existing sites, versus new pages on new sites. 4.1 Cover size For each site at each time, we construct the bipartite graph H and employ Greedy to cover all new pages. Figure 2 plots a point for each site, each timestep, and each partial cover (i.e., for a cover of size 10, we show a point for the first node of the cover, the first two nodes, and so forth--each successive partial cover captures more nodes with higher overhead than the previous partial cover). A point at location (x, y ) represents a cover of size x that covers y new pages. The figure represents approximately 400 tra jectories, one for each site at each timestep, but many of these are overlapping; the lower graph of the figure shows a breakout of the smaller tra jectories at a larger scale.2 The graph clearly shows the diminishing returns as each cover grows. Further, an examination of the knee of the curves shows that most covers efficiently capture 90% of the total new pages, but must work much harder to cover the remaining 10%. We present a detailed aggregation of these numbers in Figure 3(a-b). We perform an experiment in which we employ Greedy for a particular site at a particular time, but terminate processing when either all new pages have been covered, or the current cover has reached a certain size k; this corresponds to the k-budgeted cover problem. In Figure 3(a-b), the x-axis represents the threshold k that is the maximum size cover we will employ for any site/time pair. 2 The outlier tra jectory in the top graph of Figure 2 is from the site oreilly.com. It came about when a content management change caused over 2,000 catalog entries to introduce a link to a new variant of the same page; the new destination was discoverable from no other page on the site. Thus, the limit of the anomalous tra jectory is a line of slope 1, in which each recrawl event yields a single page of new content. Figure 2: Cover size versus the numb er of pages covered. Figure 3(a) shows two curves. The higher curve is measured on the left axis; it shows for each value of k the average number of new pages captured by the cover. However, notice that for a fixed value of k, each site/time pair might have a cover of k or smaller, depending on whether a smaller cover was adequate to capture all the new pages. We therefore also include the lower curve, which is measured on the right axis. It shows for each value of k the overhead of the cover. As k grows large, the number of pages covered tops out at about 300 on average, which is a reflection of our dataset. However, the overhead never exceeds 9%, indicating that although the rightmost region of the curve returns 300 new pages per cover, with k = 600, nonetheless the "average" cover size is in fact only 9% of 300, or about 27. We mention in passing that, while the x-axis of the figure has been truncated at 600 to focus on the region of interest, the remainder of both curves are stable at 300 and 9% respectively. Figure 3(a) is therefore a measure of how efficiently covers truncated at a certain size can return new content, but so far we have said nothing about what fraction of the total new content has been returned. Figure 3(b) covers this question. Once again, the x-axis represents the threshold k on the cover size, and the y -axis now shows the overall fraction of new pages that would be covered, if all covers were truncated at size k. Setting k = 200, we cover 97.3% of all new content. We cover 90% of new content once k reaches 83. 4.1.1 90% covers Based on the above observations, it appears possible to cover 90% of new content with relatively low overhead. We therefore adopt this somewhat arbitrary threshold, and study 424 WWW 2007 / Track: Search Session: Crawlers Figure 3: (a) Overhead and numb er of covered pages, (b) fraction of new pages covered, (c) 90% cover statistics. Figure 4: Histogram of 90% cover sizes. the nature of covers that capture at least 90% of the new pages for a give site/time pair. Figure 3(c) is a scatter plot showing a detailed breakout of this information. A point at (x, y ) means that a particular site at a particular time had a cover of size x that covered y new pages. As algorithm Greedy adds a page only when it results in the discovery of at least one page of new content, there are no points below the line y = x. The figure is promising to the extent that there is a significant mass of points far from the line. Note that the figure is shown in log-log scale, and there are clearly numerous points in which a small cover produces a large number of new pages. We may ask about the distribution of sizes of these 90% covers. Figure 4 shows this distribution as a histogram, showing the number of site/time pairs for which the 90% cover has a certain absolute size. Small covers of five or fewer pages suffice to capture 90% of the new content of most sites, but for a nontrivial number of sites, covers of more than a hundred pages are required. No site in our sample ever required a 90% cover of more than one thousand pages. Figure 5: Overlap distribution. gories: either they link to almost the same set of new pages, or they have almost no new pages in common. Figure 5 shows that a significant fraction of pairs have Jaccard coefficient very close to 0 or very close to 1. This has important algorithmic implications, as we will see later in Section 5.2. 4.3 Overhead of discovering new pages Figure 6 shows the overhead for various cover sizes. As the figure shows, and as stated above, we attain 90% covers with 3% overhead, and 100% covers with 9% overhead. Recall, however, that these numbers are the results of a thought experiment in which a crawler happens to pick a near-perfect set of pages to crawl in order to find new content; they represent a goal we would like to attain. The reader should be heartened that the numbers look so promising, but should await Section 5 to determine whether these numbers can be matched by a real algorithm that must search for new content in a more hit-or-miss fashion. 4.4 Overhead of discovering new sites In this section we study the relative importance of discovering new pages on old sites, versus new pages on new sites. We have presented statistics showing the performance of Algorithm Greedy on each individual site, aggregated in various ways. We now ask how Greedy might perform across all sites at once, by operating on the union of the bipartite graphs Ht corresponding to each individual site. 4.2 Node redundancy If no two old pages link to the same new page, then the cover problems addressed by Algorithm Greedy become trivial; the problem is interesting only when there is overlap in the set of new pages covered by old pages. In our data, most pairs of pages (within a site) fall into one of two cate- 425 WWW 2007 / Track: Search Session: Crawlers Figure 6: Global discovery of new pages on old sites. Snapshot 12 23 new pages on new sites 452,461 173,537 new pages on old sites 2,404,045 2,272,799 Pr[new site | new page] 16% 7% Figure 7: Chile site-level discovery. in an overall overhead of just 3.7%, even for 100% coverage. These results, combined with the observation from Table 1 that most new pages occur on old sites, convince us to focus the remainder of our exploration on existing sites. Table 1: Fraction of new pages app ears on new sites versus old sites in the Chilean web data set. 5. HISTORY-BASED ALGORITHMS In the previous section we studied the feasibility of using a small set of existing nodes to cover most of newly generated content -- i.e., we measured whether there exists a small set of old nodes with links to most of the new content. In this section we move to the algorithmic question of choosing such a set of nodes when we do not have access to the entire bipartite graph Ht . We assume that we have access to the old nodes Xt but not to Zt , the set of edges, or to Yt , the set of new nodes. (In reality, we may only have access to a subset of Xt since some nodes in Xt may not have been discovered at t due to incomplete crawling before t. We ignore this for now.) In this section we explore algorithms that use historical information, i.e., statistics from Ht-i , in order to discover new content in Ht . There are two separate questions: how to aggregate information from the various Ht-i to estimate relevant statistics, and second and more open-ended, which statistics lead to good covers? To address this, we describe and evaluate three algorithms that employ different statistics gathered from past observations to solve the k-budgeted cover problem. The first algorithm, Od, crawls pages according to the number of new pages discovered historically when crawling the page. The second algorithm Cliq employs past degree information as well, and in addition uses information about overlaps in the set of pages discovered by each pair of pages. Rather than computing and storing all pairwise information between existing pages, Cliq groups existing pages into clusters that have produced the same set of pages in the past, according to the gap observation of Section 4.2, and employs this information in order to choose a cover. The third algorithm Cov uses historical results of the algorithm Greedy, i.e. it chooses to track pages that were previously in the cover constructed from full recrawls of the data. In what follows, we define S be the optimal solution to the k-budgeted cover problem on Ht (Section 2). Let S be the solution returned by an algorithm Alg. We define (Alg) as the ratio of the number of new nodes covered by S to that covered by S , i.e., (Alg) = N (S )/N (S ). We use N to denote the total number of new nodes. When asked for a 90% cover, such an algorithm may cover 90% of each site, or may cover many sites completely while covering others only superficially, based on the relative gains of each crawl event. We observe in passing that such an algorithm is simple to implement once Algorithm Greedy has already run on each site: a greedy cover of disjoint sets may be constructed simply by interleaving the greedy covers of each set, in a greedy manner. That is, each site may be viewed as a set of candidate pages, ordered according to the greedy cover for that site. The algorithm must choose the top remaining page from some site, and it does so by selecting the one that covers the largest number of uncovered resources. We perform the following experiment on the three snapshots of the Chilean web. We begin by observing that a site that appears for the first time during the second or third month contains on average 18.1 pages. Thus, the effort of discovering such a site may be amortized across the 18+ pages that will be returned by crawling the site. Table 1 considers each page that occurred for the first time in the second or third month of the crawl, and checks to see whether the domain of the page occurred earlier. As the results show, 16% of new pages in the second snapshot, and 7% of pages in the third snapshot, occur on sites that did not appear during the previous snapshot. This suggests that the vast ma jority of new content appears on existing sites, rather than new sites. Figure 7 shows the number of existing pages that must be crawled in order to discover new web sites. Comparing Figures 6 and 7, we see that many more pages must be crawled to discover new sites than to discover new pages on existing sites. (The main reason the latter problem is easier is the propensity of webmasters to include useful pages guiding us to new information, e.g., the "What's new" page.) In the new site discovery problem, depending on the fraction of new sites that must be covered, each page fetch will yield between 1.5 and 3 new sites. However, as we observed above, each of these sites will return on average 18.1 pages, resulting 426 WWW 2007 / Track: Search Session: Crawlers 5.1 Algorithm based on outdegree We consider a very basic algorithm first. Suppose that for every old node i, we have an estimate of pi = |N (i)|/N , the fraction of new nodes covered by i. A natural algorithm is the following: pick k old nodes with the largest pi 's and crawl these nodes. We refer to this algorithm as Od. Below, we state a bound on its performance, if the pi 's are correct estimates. Subsequently, we will define variants of this algorithm that are amenable to experimentation, based on different approaches to estimating the pi values. |N (i) N (k)| |N (i) N (k)| - b(|N (i) N (j )| + |N (j ) N (k)|) (|N (i) N (j )| + |N (j ) N (k)| 1- b |N (i) N (k)| ,, « |N (i) N (j )| |N (k) N (j )| 1- b + |N (i)| |N (k)| ,, « 1 1 1- b + , 1- b 1- b Jik Lemma 1. Let p[j ] denote the j -th largest of the pi 's. Then, p[1] . (Od) P2k-1 p[1] + i=k+1 p[i] Proof. Suppose there are N1 new nodes obtained from nodes with degrees p[2] , ..., p[k] that are distinct from the new nodes obtained from p[1] . The number of new nodes found by the greedy algorithm is N p[1] + N1 . The number of new nodes found by the optimum cannot be greater than Pk N p[1] + N1 + N 2=k+1 p[i] (recall that p[i] are decreasing). i So (Od) N p[1] + N1 P2k-1 p[1] P2k-1 . that is strictly greater than s for b, s 1/3. The last line follows from |N (i)| |N (i) N (j )| (1 - b)|N (i) N (j )|, and similarly for k. In summary, we showed that Jij (1 - b), Jj k (1 - b) Jik > s for b, s 1/3. Recall that J·,· is a metric. By our assumption, Jik is either greater equal (1 - b) or less equal s, so we have shown that Jik (1 - b), i.e., old nodes can be partitioned into equivalence classes. We analyze the performance of the following algorithm, Cliq. Let C1 , . . . , C be the equivalence classes as above and let k = min(k, ). Let qi = maxj Ci pj be the degree of the highest-degree node in i-th equivalence class and let ni be the node with this degree. We first sort C1 , . . . , C in order of descending pi 's. The output S of the algorithm is the set of ni 's corresponding to the k largest qi 's. Theorem 1. (Cliq) 1-k s . 1+k b N p[1] + N1 + N i=k+1 p[i] p[1] + i=k+1 p[i] The above bound is tight. If the degree distribution of nodes in Xt is a power law, the bound shows that this naive algorithm will perform very well. However the presence of mirrors can cause this fraction to be as small as 1/k. This, together with the observations in Section 4.2 lead to the next algorithm. Proof. First we lower bound the number of new nodes obtained by Cliq. Denote by Tj the number of new nodes obtained by adding j to S . From n1 we get T1 = N q1 new nodes. Define qij = pni nj = |N (ni ) N (nj )|/N . From the j -th node addP by Cliq, the number of new nodes obtained ed -1 is Tj N qj - j=1 N qij . Since ni and nj belong in different i classes, Jni nj s, so Jni nj |N (ni ) N (nj )| N s(|N (ni )| + |N (nj )|) N = s(qi + qj ). P -1 Substituting above, Tj N qj - N s j=1 (qi + qj ). Sumi ming over all j , ! X X k k X Ti N qi - N s(qi + qj ) qij i=1 i=1 X 5.2 Algorithms based on overlap Here we describe an algorithm for choosing a small cover that exploits estimated overlap information. Let pi be as above, and for a pair of old nodes i, j , let pij be the fraction of new nodes that i and j both cover: pij = |N (i) N (j )|/N . Figure 5 empirically demonstrated that most nodes overlap in either a very large or a very small set of links. We state a lemma showing that under an idealized form of the observation, it is possible to uniquely partition nodes into groups that all link to almost the same set of new nodes. Then, Lemma 2. Let b, s 1/3. If for al l nodes i, j , we have either Jij 1 - b or Jij s, then the set of old nodes Xt can be partitioned into equivalence classes, where every pair of old nodes i, j in an equivalence class has Jaccard coefficient Jij (1 - b). Proof. We will show that for such , if Jij 1 - b, Jj k 1 - b, then Jik cannot be less than s. From the assumptions, |N (i) \ N (j )| b|N (i) N (j )|, and similarly |N (k) \ N (j )| b|N (k) N (j )|. So the most number of elements not in common between i and k is b(|N (i)N (j )|+ |N (j ) N (k)|), i.e., j