Finding Planted Partitions in Nearly Linear Time using Arrested Spectral Clustering Nader H. Bshouty Department of Computer Science, Technion, 32000 Haifa, Israel Philip M. Long Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043 USA bshouty@cs.technion.ac.il plong@google.com Abstract We describe an algorithm for clustering using a similarity graph. The algorithm (a) runs in O(n log3 n + m log n) time on graphs with n vertices and m edges, and (b) with high probability, finds all "large enough" clusters in a random graph generated according to the planted partition model. We provide lower bounds that imply that our "large enough" constraint cannot be improved much, even using a computationally unbounded algorithm. We describe some experiments running the algorithm and a few related algorithms on random graphs with partitions generated using a Chinese Restaurant Processes, and some results of applying the algorithm to cluster DBLP titles. the most harm. Because the input is large, and such duplicate-detection must be done often, efficient algorithms are important. Even if massive parallelism is available to speed up algorithms, an algorithm that expends limited resources overall is still valuable as a means to reduce costs and make limited computing resources available for other purposes. We analyze algorithms using the planted partition model (Condon & Karp, 2001; McSherry, 2001) in which an adversary partitions the vertices of a graph into clusters C1 , ..., Ct , and then edges are included between members of the same cluster independently with one probability p, and edges are included between members of different clusters with another, smaller, probability q. We show that, if p 1/2, a O(n log3 n + m log n) time algorithm can, with probability 1 - 1/poly(n), output a partition U1 , U2 , ...Us that, for each input cell Ci of size (max{qn, log n}), contains a cell Uj such that Uj = Ci . We also provide lower bounds that imply that this bound on the size of Ci cannot be improved much, even with unlimited computation. Condon and Karp (2001) analyzed a nearly linear-time algorithm for recovering a planted partition in the case that the cells of the partition have the same size, and this size, and the number of cells, is known to the algorithm a priori. Their algorithm applies local perturbation to partition the vertices into two groups, and then applies the algorithm recursively to each of the two groups. They used the knowledge of the size of the cells to determine whether to make another recursive call or not. Recent research has provided approximation guarantees for polynomial-time clustering algorithms using weaker assumptions than the planted partition model that we study here (Bansal et al., 2004; Kannan et al., 2004; Balcan et al., 2008), but using algorithms that require substantially more than linear time. 1. Introduction This work is aimed at using theoretical analysis to guide the design of large-scale clustering algorithms. We are especially interested in problems in which the purpose of clustering is to identify near-duplicates, for example to eliminate them from the index of a search engine. For problems like these, it is possible to use a technique such as minhashing (Cohen, 1997; Broder, 1997) to build a graph in which most pairs of items that should be clustered together are represented by edges, and there are not too many spurious edges. There can be billions, or more, of vertices. There are many clusters, and many vertices are not clustered. The largest clusters are the most important, since they do Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). Arrested Spectral Clustering McSherry (2001) analyzed a spectral clustering algorithm using a model much like the embedded partition model studied here. The aim of this work was to obtain similar guarantees with a faster algorithm. A number of authors have proposed to approximate the spectrum of similarity matrices for applications like this by different sampling schemes (Williams & Seeger, 2000; Frieze et al., 2004; Drineas & Mahoney, 2005; Kumar et al., 2009). Achlioptas and McSherry (2007) provided a theoretical analysis of the approximation capabilities of one such scheme. However the approximation is performed, however, when there are k clusters, the resulting spectral clustering algorithms appear to take (k 2 ) time. In contrast, the Subsquare algorithm takes time nearly linear in the number of edges, no matter how many clusters there are, which appears to make it more suitable for applications like nearduplicate detection. This is supported by some experiments ­ we tried out an algorithm obtained by using a sampling scheme like the one analyzed by Achlioptas and McSherry in concert with the spectral clustering algorithm proposed by Ng, Jordan and Weiss (2001). Bansal, et al (2004), in Section 6 of their paper, described an algorithm based on neighborhood overlap, together with a sketch of an analysis in the case in which edges between members of the same cluster are included with probability 1. Gibson, et al (2005) described a somewhat complicated algorithm that starts by applying minhashing to vertex neighborhoods in an effort to quickly find vertices with similar neighborhoods. The Subsquare algorithm proposed in this paper can be viewed as an approximation to more computational intensive spectral algorithms. Roughly, the spectral algorithms like to cluster together a pair (v, w) of vertices if a random walk from v is likely to land at w (von Luxburg, 2007). We show that, to obtain guarantees for the embedded partition model, it is sufficient to examine walks of length 2, or, in other words, to consider the extent to which neighborhoods of pairs of vertices overlap. However, simply squaring the adjacency matrix, even in combination with random sparsification as in (Achlioptas & McSherry, 2007), appears not to work ­ if we sparsify sufficiently to achieve nearly linear time, then the neighborhoods of vertices in large clusters will not overlap enough to be detected. When deciding whether to cluster together two vertices, instead of considering the overlap between random subsamples of both of their neighborhoods, Subsquare checks how many members of a subsample of one vertex's neighbors are in the list of all of the other vertex's neighbors. 2. The Problem and Results 2.1. Planted partition model We will analyze algorithms using a slight variant of the planted partition model (Jerrum & Sorkin, 1998) (see also (Condon & Karp, 2001; Bui et al., 1987; McSherry, 2001)). Let G = G(V, E) be a random graph with the set of vertices V = {v1 , . . . , vn } and set of edges E that is built as follows: (1) Choose an integer t and arbitrary disjoint subsets C1 , . . . , Ct V where C1 C2 · · · Ct = V ; (2) For every i 1 and every v, w Ci add the edge (v, w) to E with probability p; (3) For every 1 i < j t, v Ci and w Cj add the edge (v, w) to E with probability q. The choices of whether to add the edges are mutually independent. We call a random graph generated this way a (p, q)-clustered graph. The sets C1 , . . . , Ct are called clusters. We can think of clusters of size 1 as unclustered vertices. For any vertex v, let C(v) be the cluster containing v, let N (G, v) be the neighbors of v in G. We also will assume that p 1/2 (roughly, that a majority of good edges are present). Let m be the number of edges in the input graph, and n be the number of vertices. 2.2. The Algorithm The algorithm Subsquare makes its clustering decisions based on sampling estimates of some edges of the square of the adjacency matrix of G. The algorithm has three parameters, c0 , c1 and , which will be determined in the analysis. Details are as follows: 2.3. Main results Theorem 1 There are constants c, 0 , m0 and n0 such that, for all p [1/2, 1], q [0, 1], m m0 , n n0 , and 0 < 0 , if G is a (p, q) clustered graph with clusters C1 , ..., Ct , then, with probability 1 - , Algorithm Subsquare, when run with c0 = c/2 and c1 = c/25 , outputs a partition U1 , ..., Us of the vertices of G such that, for all cluster indices i such that |Ci | c max {qn, log(n/)} (1) there is a output cluster index j such that Ci = Uj . We will show that a constraint like (1) is necessary. Note that if all Ci satisfy (1), then Subsquare is guaranteed to output the correct clustering with high probability. Arrested Spectral Clustering · Choose a random bijection : V {1, ..., n} to order the vertices. · Make two passes over V in the order given by , performing the following steps for each vertex v: 1. If v has fewer than c0 log(n/) neighbors, assign v to its own cluster and go to the next v. 2. Let Rtemp consist of the neighbors of v that have already been assigned to clusters. 3. Form R by independently including each element of Rtemp with probability min c0 log(n/) |Rtemp | , 1 Definition 3 Refer to the comparisons performed for a particular value of w in Step 6 of Subsquare as an edge test. Say that the test succeeds if C(v) = C(w) and v is put into D, or if C(v) = C(w) and v is not put into D. ^ Note that, to put C(w) into D, Subsquare requires both that |S N (G, w)| c1 log(n/) and that |Sw N (G, v)| c1 log(n/). Lemma 4 W.h.p., for any vertex v whose cluster C(v) satisfies (1), at least half of v's neighbors are in C(v). Proof: Since edges between vertices from distinct clusters are included with probability q, we have E(|N (G, v) - C(v)|) qn |C(v)|/16, by (1), as long as c 16. But, since |C(v)| c log(n/), the standard Chernoff bound (see Theorem 4.3 of (Motwani & Raghavan, 1995)) implies that, w.h.p., |N (G, v) - C(v)| |C(v)|/8. On the other hand, since p 1/2, we have E(|N (G, v)C(v)|) (|C(v)|-1)/2, and, since |C(v)| c log n, this implies that, w.h.p., |N (G, v) C(v)| (|C(v)| - 1)/4 |C(v)|/8. Lemma 5 For a pair {v, w} of vertices, if C(v) = C(w), and the cluster satisfies (1), and an edge test is conducted for v and w, then, w.h.p., this edge test succeeds. Proof: First, we have that, w.h.p., |S C(v)| (c0 /4) log(n/) (2) . 4. Choose S by independently including each member of N (G, v) with probabil0 log(n/) ity c|N (G,v)| . 5. Initialize the set D of candidate clusters for v to . 6. For each w R, if (a) |S N (G, w)| c1 log(n/), (b) |N (G, w)| c0 log(n/), and Sw obtained by sampling each u N (G, w) 0 log(n/) with probability c|N (G,w)| satisfies |Sw N (G, v)| c1 log(n/), ^ add C(w) (i.e., w's cluster) to D. ^ 7. If D is not empty, set C(v) to ^ C(argminw CD (w )) and otherwise, start a new cluster consisting only of v. Figure 1. Algorithm Subsquare. since Lemma 4 implies that w.h.p., |N (G, v) C(v)| |N (G, v)|/2, which in turn implies E(|S C(v)|) ( c0 log(n/) )(|N (G, v)|/2), from which (2) follows by a |N (G,v)| Chernoff bound. Now, conditioned on a fixed value of S, which depends only on the random neighbors of v and the randomization of the algorithm, the events that the members of S C(v) are neighbors of w are independent. Given (2), we have E(|S N (G, w)|) E(|S C(v) N (G, w)|) |S C(v)|/2 since p 1/2. Thus, a Chernoff bound implies that, w.h.p. |S N (G, w)| c1 log(n/). Arguing analogously, we get that, w.h.p. |Sw N (G, v)| c1 log(n/), and therefore the lemma holds. Lemma 6 For any pair {v, w} of vertices, if C(v) satisfies (1), and an edge test is conducted between v and Theorem 2 The expected running time of Subsquare is O(n(log2 (n/))(log n) + m log n). 3. The analysis Our proof of Theorem 1 is broken up into a series of lemmas. Throughout this analysis, we will follow the convention of using "w.h.p." as a shorthand for "with probability 1 - /poly(n)". As we will see, polynomially many events will be bounded with this probability, so that, with probability at least 1 - , all of them hold. We will show that, for any polynomial in this confidence bound, some values of the constants in our analysis will work. Arrested Spectral Clustering w, and C(v) = C(w), then w.h.p. the edge test is successful. Proof: Because the edge test is symmetrical, we may assume without loss of generality that it takes place during v's turn. Thus, it suffices to show that, w.h.p., S N (G, w) < c1 log(n/). For each u, let Xu be the random variable that indicates whether u N (G, v) N (G, w). Suppose that {Yu : u V } are {0, 1}-valued random variables such that Pr(Yu = 1) = min c0 log(n/) , 1 for all u, and |N (G,v)| that are mutually independent. We can think of the variables Yu as an extension of all of V of the random variables that were used to decide whether to include each member of N (G, v) in S. We claim that w.h.p., Yu (c0 /2) log(n/). uN (G,v) Applying another Chernoff bound together with the fact that the Yu 's are independent completes the proof. Lemma 7 For any C that satisfies (1) for all v C ^ and all vertices w, with high probability, if C(v) = ^ C(w) then C(v) = C(w). Proof: This follows from induction using Lemma 6. Definition 8 The first member of a cluster C encountered by Subsquare is called the head of C. Lemma 9 During the first pass through the vertices, for any C that satisfies (1), for any vertex v with an ^ ^ edge to the head vC of C, w.h.p., C(v) = C(vC ). Proof: The proof is by induction on (v). (3) If |Rtemp | c0 log(n/), then vC R and the lemma follows from Lemma 5. Suppose |Rtemp | > c0 log(n/). Since p 1/2, w.h.p., |Rtemp N (G, vC )| > |Rtemp |/4. Thus, E(|R N (G, vC )|) (c0 /4) log(n/), and, therefore, w.h.p. |R N (G, vC )| = , and the lemma follows from Lemma 5 together with the inductive hypothesis, together with the fact that Subsquare assigns v to the cluster in D with the least head node. Lemma 10 During the second pass through the vertices, for any C that satisfies (1), for any v C, ^ ^ C(v) = C(vC ). Proof: Since p 1/2, E(|N (G, v) N (G, vC )|) |N (G, v)|/2. During the second pass, Rtemp = N (G, v), thus, w.h.p., |R N (G, vC )| = . By Lemma 9, though, for any w N (G, vC ), after the ^ ^ first pass, w.h.p., C(w) = C(vC ). So, w.h.p., some w ^ ^ C ) will be in R. The lemma then with C(w) = C(v follows from Lemma 5. Now that we have finished the proof of Theorem 1, we now turn to Theorem 2, the analysis of the running time. Proof of Theorem 2: In O(m log n) time, Subsquare can create a data structure that will enable it to test membership in N (G, v) for any vertex v in O(log n) time. This can be done by creating a balanced binary tree for each vertex v with the neighbors of v on the leaves. These "neighbor trees" can be found if there is a higher-level balanced binary tree that has a pointer to a neighbor tree for each vertex v at each leaf. We can then see that, for each vertex, at most O(log(n/)) candidate clusters are examined, and each If |N (G, v)| c0 log(n/), this is because in this case Pr(Yu = 1) = 1 for all u, and therefore uN (G,v) Yu = |N (G, v)|. Otherwise, it follows from a Chernoff bound, together with the fact that E uN (G,v) Yu c0 log(n/). Since S is a subset of the vertices for which Yu = 1, we have that |S N (G, w)| uV -{v,w} Xu Yu . Note that, though {Xu Yu : u V - {v, w}} might not be independent, after we condition the values of {Xu : u V - {v, w}}, then they are. We will first obtain a high probability bound on u Xu , and then use it to get a bound on u Xu Yu that holds with high probability. For each u V - {v, w}, either u C(v) or u C(w), so Pr(u N (G, v) N (G, w)) pq. Thus E( u Xu ) pqn qn |C(v)|/29 by (1), so long as c 29 . Note that each Xu is determined solely by the random generation of the graph, thus the various Xu variables are mutually independent. Applying a 8 Chernoff bound, w.h.p., u Xu |C(v)|/2 . Since p 1/2, w.h.p., |N (G, v)| |C(v)|/4, so Xu |N (G, v)|/26 . u (4) Since for each u, Pr(Yu = 1) c0 log(n/) , we |N (G,v)| have that, conditioned on the event that (4) holds, E( u Xu Yu ) (c0 /25 ) log(n/) (c1 /2) log(n/). Arrested Spectral Clustering candidate requires neighbor lookups for O(log(n/)) neighbors, each requiring O(log n) time. 4. Lower bounds Theorem 11 For any function such that (n) = o(log(n)), there is no algorithm A with the following property: There are constants c, m0 and n0 such that, for all p [1/2, 1], q [0, 1], m m0 , and n n0 , if G is a (p, q) clustered graph with clusters C1 , ..., Ct , then, with probability 1/8, Algorithm A, outputs a partition U1 , ..., Us of the vertices of G such that, for all cluster indices i such that |Ci | c max {qn, (n)} there is a output cluster index j such that Ci = Uj . Proof Sketch: Suppose p = 1/2 and q = 0. Let t = n/(c(n)). We will show that, for a randomly chosen partition C1 , ..., Ct , Algorithm A fails with probability at least 1/8, which will imply the existence of a partition C1 , ..., Ct with this property. Suppose that the random partition over the vertices v1 , ..., vn is chosen by assigning vi to Ci for i t, and, for i > t, assigning each vertex independently to a random cluster. Let C be this clustering. Suppose further that, if any vertex v {v1 , ..., vt } (i.e., whose cluster assignment was chosen randomly) is not incident on any edges, then Algorithm A is given the cluster assignments of all vertices other than v. A lower bound given this assumption implies a lower bound for the original clustering problem, since A has the option to ignore the additional information. If C = (C1 , ..., Ct ) consists of the cluster assignments to vertices other than v, we have Pr(C(v) = j|G) = Pr(C(v) = j|N (G, v), C ). corresponding cluster in C1 , ..., Ct has more than one member. B1 , ..., Bt are negatively associated in the sense of (Joag-Dev & Proschan, 1983) (see (Dubhashi & Ranjan, 1998)). For each i, E(Bi ) = (1 - 1/t)n-t-1. Since negatively associated random variables obey Chernoff bounds (see (Dubhashi & Ranjan, 1998)), we have Pr( i Bi 2) exp(-(t/8)(1 - 1/t)n-t-1 ) (1 - 1/t)n-t-1 c(n) = exp -n = exp(-n exp(-((n)))) = exp(-n1-o(1) ). Thus, for large enough n, with probability at least 1/2, we have that, for distinct i and j, ni = nj = 1, and thus the conditional probability that A makes a mistake, given that N (G, v) = , is at least 1 - 1 1/2. t -ni 2+ i=3 2 Now let us compute a lower bound on the probability that any vertex is disconnected. At least t/2 of the clusters have at most 2n/t elements. Let us call these clusters light. Let us lower bound the probability of an isolated vertex by the probability of an isolated vertex in a light cluster. For some such cluster, let be the number of elements in the cluster. Any individual vertex in the cluster is disconnected from the rest with probability 2- , so the probability that the cluster is not connected is at least this large. Since the edges in different clusters are chosen independently, the probability that any light cluster is disconnected is therefore at least 1 - (1 - 2-2n/t )t/2 which is at least 1 - exp(-(t/2)2-2n/t ). Now, n t -2n/t 2 = lim 2-2c(n) n 2(n) 2 = lim exp(ln n - o(ln n)) = . n lim n Further, applying Bayes' rule, it is possible to show that Pr(C(v) = j|N (G, v) = , C ) = j 2 t Pr(N (G,v)=|C ) . -|C | Thus, for large enough n, the probability that some cluster is disconnected is at least 1/2, so that the probability of error is at least 1/8. Theorem 12 For any function such that (n) = o (n), there is no algorithm A with the following property: There are constants c, m0 and n0 such that, for all p [1/2, 1], q [0, 1], m m0 , and n n0 , if G is a (p, q) clustered graph with clusters C1 , ..., Ct , then, with probability 3/4, Algorithm A, outputs a partition U1 , ..., Us of the vertices of G such that, for all cluster indices i such that |Ci | c max {q(n), log n} there is a output cluster index j such that Ci = Uj . Thus, given that N (G, v) = , together with the knowledge of the cluster assignments other than v, a Bayes optimal algorithm chooses arbitrarily from among the clusters of smallest size, and, if ni = |Ci |, then the probability that the Bayes optimal algorithm max 2-nj makes an error on v is 1 - t j -nj . j=1 2 We claim that, for large enough n, with probability 1/2, the cluster assignments to vertices other than v leave at least two clusters with one member each. Let each B1 , ..., Bt be an indicator variable for whether the Arrested Spectral Clustering n Proof: Let t = (n) and p = q = 1/2. Note that, since (n) = o(n), if n is large enough, t 2. Suppose C = (C1 , C2 , ..., Ct ) is obtained by picking an arbitrary vertex v, randomly partitioning the remaining vertices into equal-sized clusters, and then randomly assigning v to one of those clusters. Since the edge probabilities are unaffected by the cluster assignments, and algorithm A that makes random cluster assignments is Bayes optimal, and therefore its probability of error is at least 1 - 1/t 1/2. Since this is true on average for a randomly chosen clustering C, there exists a clustering C such that A has probability 1/2 of making an error for C. k-means, which also has a variable number of iterations. We experimented with various settings of these parameters, as described below. The second is an algorithm that performs minhashing (Cohen, 1997; Broder, 1997), and forms a cluster from each bucket. This can be equivalently described as an algorithm that processes the vertices in random order, and, if a vertex v has not been assigned a cluster when it was encountered, a new cluster is formed with v and its unclustered neighbors. The third is version 9.308 of MCL (van Dongen, 2000; Enright et al., 2002). The key parameter for MCL, which trades between precision and recall, is called "expansion". We found that, on this data, the default value of 2 optimized excessively for precision, so we tried 1.5, 2, and 3 which roughly covered the range of values suggested on the MCL website. To save space, we report only on the best result, for 1.5, below (the other results are significantly worse). While Subsquare and the minhash algorithm were coded in python in the most straightforward way, the MCL implementation is optimized C code. 5.3. Chinese Restaurant Process Data To get large-scale data with known cluster assignments and some variety in cluster sizes, we generated data by first partitioning the vertices using a Chinese restaurant process (Pitman, 1995; Teh, 2010) with = 1 and = -20/n. This tends to generate clusters of sizes roughly on the order of 20 vertices, which is similar to the most challenging applications targeted in this work. We generated graphs with n = 10000, 20000, 50000, 100000 vertices. We included edges between vertices in the same cell with probability 1/2, and then added random edges until the number of such "noisy" edges was equal to the number of "clean" edges. For each such dataset we ran Subsquare and the minhash algorithm 10 times, and averaged the precision and recall obtained. (When computing precision and recall, we interpret a clustering as a series of predictions of whether each pair of vertices are in the same cluster.) We also recorded the time taken by each algorithm. We stopped the MCL algorithm after it ran for 1000 minutes on the two larger datasets. We ran Subsquare with and c1 chosen so that c1 log(n/) = 100, and with the threshold described in Section 5.1 chosen to be 0.05. Results are as follows (time is in minutes): 5. Experiments 5.1. Modifications to Subsquare We performed our experiments with a slightly modified version of Subsquare. First, instead of independently randomly putting each log(n/) vertex in R with probability c0|Rtemp | , the modified algorithm sets R to be a random subset of size at most c0 log(N/), and it chooses S similarly. Second, in Step 6 of Figure 1, instead of requiring that w has c0 log(N/) neighbors, the revised algorithm estimates the probability that a random neighbor of w is also a neighbor of v dividing the number of neighbors of w found in v's adjacency list by one more than the number of neighbors tried. Third, when computing the above statistics, the counts for neighbors w of v that have been assigned to the same cluster C(w) are consolidated, so that the algorithm computes an overall estimate p(C, v) of the ^ probability that a random neighbor of v is in an adjacency list of a random vertex that has been assigned to cluster C. Vertex v is then assigned to the cluster, from among those for which p(C, v) , which con^ tained the most of the neighbors w that were examined during v's turn. 5.2. Other Algorithms The first, inspired by the analysis of Achlioptas and McSherry (2007), randomly samples edges with probability p, and then applies the spectral clustering algorithm of Ng, Jordan and Weiss (2001). The NJW algorithm requires the user to specify the number k of clusters. We computed the top eigenvectors using the iterative block power method (Strang, 1986) which has the number of iterations as an adjustable parameter. The NJW algorithm also calls for the application of Arrested Spectral Clustering n 10000 20000 50000 100000 Minhash time F 1 0.24 1 0.24 1 0.24 3 0.24 Subsquare time F 3 0.97 6 0.98 13 0.99 28 0.99 MCL time F 47 0.96 175 0.97 >1000 >1000 the same cluster in the preceding step; (e) Compare the predictions with the actual presence of absence of edges between vertices in W (these edges constitute the "test graph"). We only used a single training-test split because our graphs are large. A fraction 1/10 of the vertices were placed in the test graph. We first used the bi-holdout estimate to compare the algorithms on the synthetic data described above. We applied both Subsquare and the minhash algorithm to the graph in the n=100000 case, and compared the results using this protocol. The Subsquare algorithm achieved an F-score of 0.136, where the minhash algorithm obtained 0.046. While Subsquare still performs better, the edge-prediction accuracy obtained is much smaller than the cluster-membership accuracy that was measured earlier. This is not surprising, in light of the fact that there are three chances for a correct determination that two vertices u and v in the test graph should be clustered together to fail to be recognized by this protocol: (1) the edge between u and v may be missing from the graph, (2) a spurious edge from u to the training graph may be chosen, and (3) a spurious edge from v to the training graph may be chosen. Nevertheless, since these should be expected to effect all algorithms equally, the statistics obtained through this protocol should still give an idea of the relative merit of clustering algorithms. Next, we applied performed a similar experiment using the titles in the DBLP database. We used a variant of minhashing (Cohen, 1997; Cohen et al., 2001; Li et al., 2008) to compute the similarity graph. Each title was summarized using a bag of its length-4 substrings (called "k-mers", with k = 4). Next, we removed all k-mers that appeared in at least 1000 titles from all bags. We then fixed a random permutation of the k-mers, and further summarized each title by a list of the first 25 k-mers, in sorted order according to this permutation, appearing in its bags. Titles whose resulting "sketches" shared at least 5 k-mers were deemed similar and given an edge in the graph. The resulting graph had roughly 1.5 million edges, and was clustered by Subsquare on a four-year-old desktop workstation in less than 25 minutes. The bi-holdout F-score for Subsquare (here with and c1 chosen so that c1 log(n/) = 100 and = 0.9)) was 0.504, where the minhash algorithm obtained 0.471. The time required by Subsquare is seen to scale linearly with the number of vertices. Subsquare achieves better accuracy than MCL while requiring an order of magnitude less time. The time required by MCL appears to be scaling superlinearly. For the sampling spectral algorithm, we tried (a) sampling each edge with various probabilities p: 0.05, 0.1, 0.2; (b) setting the number of clusters to various values s: 100, 200, 500 (since the expected size of each cluster is 20 the true number is approximately 500); (c) running the block power method for various numbers of iterations Ipow : 10, 25, and 100; (d) running k-means for various numbers of iterations Ik : 2, 5, and 10. We tried all 81 combinations of these parameters on the n = 10000 dataset. After eliminating the combinations that required more than 1000 minutes to run, the five combinations of parameters with the best values of the F -score are shown, along with the Subsquare result. p 0.2 0.2 0.2 0.2 0.2 s Ipow 200 25 200 10 200 10 200 25 100 100 Subsquare Ik 10 10 5 5 10 time 503 231 212 466 920 3 F-score 0.59 0.55 0.53 0.52 0.40 0.97 5.4. Bi-holdout experiments To evaluate algorithms on the DBLP data, we used a single training-test split variant of bi-cross-validation (Gabriel, 2002; Owen & Perry, 2009), which might be called a bi-holdout experiment. For graph-based clustering, the most natural application of bi-crossvalidation appears to be to repeatedly apply the following steps: (a) Randomly split the vertices into training vertices U and test vertices W; (b) Run each algorithm on the subgraph induced by U , called the "training graph"; (c) For each test vertex w W , determine the cluster assignment of w by using the assignment made to a random neighbor of w in U . (The edges of the graph consisting only of edges with one vertex in W and one vertex in U , which might be called the "bridge graph", are used for this.); (d) For each pair of vertices in W , predict that there is an edge between them if and only if they are assigned to References Achlioptas, D. and McSherry, F. Fast computation of low-rank matrix approximations. JACM, 54(2), 2007. Arrested Spectral Clustering Balcan, M., Blum, A., and Vempala, S. A discriminative framework for clustering via similarity functions. STOC, 2008. Bansal, N., Blum, A., and Chawla, S. Correlation clustering. Machine Learning, 56(1-3):89­113, 2004. Broder, Andrei Z. On the resemblance and containment of documents. SEQUENCES, pp. 21­29, 1997. Bui, T. N., Leighton, F. T., Chaudhuri, S., and Sipser, M. Graph bisection algorithms with good average case behavior. Combinatorica, 7(2):171­191, 1987. Cohen, E. Size-estimation framework with applications to transitive closure and reachability. JCSS, 55(3):441­453, 1997. Cohen, E., Datar, M., Fujiwara, S., Gionis, A., Indyk, P., Motwani, R., Ullman, J. D., and Yang, C. Finding interesting associations without support pruning. IEEE Trans. Knowl. Data Eng., 13(1):64­78, 2001. Condon, A. E. and Karp, R. M. Algorithms for graph partitioning on the planted partition model. Random Structures and Algorithms, 18(2):221­232, 2001. Drineas, P. and Mahoney, M. W. Approximating a gram matrix for improved kernel-based learning. COLT, 2005. Dubhashi, D. and Ranjan, D. Balls and bins: A study in negative dependence. Random Structures & Algorithms, 13(2):99­124, 1998. Enright, A. J., Van Dongen, S., and Ouzounis, C. A. An efficient algorithm for large-scale detection of protein families. Nucl. Acids Res., 30(7):1575­1584, 2002. Frieze, A., Kannan, R., and Vempala, S. Fast montecarlo algorithms for finding low-rank approximations. JACM, 51(6):1025­1041, 2004. Gabriel, K. Le biplot-outil d' exploration de donn'ees multidimensionelles. Journal de la Societe Francoise de Statistique, 143:5­55, 2002. Gibson, D., Kumar, R., and Tomkins, A. Discovering large dense subgraphs in massive graphs. VLDB, 2005. Jerrum, M. and Sorkin, G. B. The metropolis algorithm for graph bisection. Discrete Applied Mathematics, 82(1-3):155­175, 1998. Joag-Dev, K. and Proschan, F. Negative association of random variables, with applications. The Annals of Statistics, 11(1):286­295, 1983. Kannan, R., Vempala, S., and Vetta, A. On clusterings: Good, bad and spectral. J. ACM, 51(3):497­ 515, 2004. Kumar, S., Mohri, M., and Talwalkar, A. On sampling-based approximate spectral decomposition. ICML, 2009. Li, P., Church, K. W., and Hastie, T. L. One sketch for all: theory and application of conditional random sampling. NIPS, 2008. McSherry, F. Spectral partitioning of random graphs. FOCS, 2001. Motwani, R. and Raghavan, P. Randomized Algorithms. Cambridge University Press, 1995. Ng, A. Y., Jordan, M. I., and Weiss, Y. On spectral clustering: Analysis and an algorithm. NIPS, 2001. Owen, A. B. and Perry, P. O. Bi-cross-validation of the SVD and nonnegative matrix factorization. The Annals of Applied Statistics, 3(2):564­594, 2009. Pitman, J. Exchangeable and partially exchangeable random partitions. Probab. Theory Relat. Fields, 102:145­158, 1995. Strang, G. Linear Algebra and its applications ­ third edition. Saunders College, 1986. Teh, Y. W. Dirichlet processes. In Encyclopedia of Machine Learning. 2010. van Dongen, S. Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, 2000. von Luxburg, U. A tutorial on spectral clustering. Statistics and Computing, 17(4):395­416, 2007. Williams, C. K. I. and Seeger, M. Using the Nystr¨m o method to speed up kernel machines. NIPS, 2000.