WWW 2007 / Track: E*-Applications Session: E-Commerce and E-Content DETECTIVES: DETEcting Coalition hiT Inflation attacks in adVertising nEtworks Streams Ahmed Metwally Divyakant Agrawal Amr El Abbadi Depar tment of Computer Science, University of California at Santa Barbara Santa Barbara, CA 93106 {metwally, agrawal, amr}@cs.ucsb.edu 1. INTRODUCTION Internet advertising flourishes as the ideal choice for b oth small and large businesses to target their marketing campaigns to the appropriate customer b ody on the fly. An Internet advertiser, e.g. ebay, provides an advertising commissioner, e.g. ValueClick, with its advertisements, allocates a budget, and sets a commission for each customer action, such as clicking an advertisement, bidding in an auction, or making a purchase. The Internet publishers, e.g. myspace.com, motivated by the commission paid by the advertisers, contract with the commissioner to display advertisements on their Web sites. The main orchestrators in this setting are the commissioners, who are the brokers b etween publishers and advertisers, and whose servers are the backstage for targeting and budgeting. Whenever a surfer visits a publisher's site, the surfer is referred to one of the servers of the commissioner, which picks an advertisement, and emb eds it in the publisher's site on the surfer's Browser. If the surfer clicks the advertisement on the publisher's site, the surfer is referred to the commissioner, who logs the click for accounting purp oses, and clicks-through the surfer to the advertiser's site. Since publishers earn revenue on impressions (rendering advertisements), as well as clicks they drive to advertisers, there is an incentive for dishonest publishers to inflate the numb er of impressions and clicks their sites generate [3, 5, 25, 28, 34, 35, 38, 45]. Moreover, dishonest advertisers simulate clicks on the advertisements of their comp etitors to deplete their advertising budgets [32, 39], which limits the exp osure of their comp etitors' advertisements. Fraudulent traffic results in bad reputation for the commissioner, and sometimes in paying forfeitures for advertisers [29, 30]. Hit inflation fraud jeopardizes not only the Internet advertising industry, but rather the entire Internet [32]. Hit inflation has b een a concern to advertising commissioners since their conception [47]1 . Most of the research investigates publishers' fraud, since the discussion can b e generalized to advertisers' fraud. Three main approaches have b een prop osed to detect hit inflation attacks. The first classical fraud detection approach employed a set of tools that judge publishers based on how the advertisements b ehave on their sites, such as how the ratio of impressions to clicks (the Click Through Rate ) for each adThe complementary problem of advertisers' hit shaving, where advertisers do not pay commission on some of the received traffic, was satisfactorily solved in [40]. 1 ABSTRACT Click fraud is jeopardizing the industry of Internet advertising. Internet advertising is crucial for the thriving of the entire Internet, since it allows producers to advertise their products, and hence contributes to the well b eing of ecommerce. Moreover, advertising supp orts the intellectual value of the Internet by covering the running exp enses of publishing content. Some content publishers are dishonest, and use automation to generate traffic to defraud the advertisers. Similarly, some advertisers automate clicks on the advertisements of their comp etitors to deplete their comp etitors' advertising budgets. This pap er describ es the advertising network model, and focuses on the most sophisticated typ e of fraud, which involves coalitions among fraudsters. We build on several published theoretical results to devise the Similarity-Seeker algorithm that discovers coalitions made by pairs of fraudsters. We then generalize the solution to coalitions of arbitrary sizes. Before deploying our system on a real network, we conducted comprehensive exp eriments on data samples for proof of concept. The results were very accurate. We detected several coalitions, formed using various techniques, and spanning numerous sites. This reveals the generality of our model and approach. Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications-- Data Mining ; K.4.4 [Computers and Society]: Electronic Commerce--Payment schemes; Security General Terms Algorithms, Exp erimentation, Performance, Security Keywords Click Spam Detection, Coalition Fraud Attacks, Approximate Set Similarity, Similarity-Sensitive Sampling, Cliques Enumeration, Real Data Exp eriments This work was supp orted in part by NSF under grants I IS 02-23022, and CNF 04-23336. Part of this work was done while the first author was at FastClick, Inc., a ValueClick company. Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. 241 WWW 2007 / Track: E*-Applications vertisement differs from the network-wide norms that are supp osedly exclusively known for commissioners [28]. However, fraudsters can make use of a sp ecific publishers' site architecture [36], sample the network-wide metrics of advertisements, acquire knowledge ab out the metrics of the advertisements loaded on their site, and can hence fool the classical tools. In addition, the classical approach cannot detect malicious intentions, and is designed to discard low-quality traffic2 , even if it is legitimate. Aggressively discounting low-quality traffic "underpays" b oth honest publishers and commissioners for a lot of the traffic their servers deliver. The second cryptographic approach [5, 38] asks for the coop eration of surfers to identify fraudulent traffic. This entails changing the advertising network model to b e nontransparent to all surfer, which is unscalable. Moreover, for the solution to b e effective, the commissioner has to uniquely identify surfers, which compromises surfers' privacy. We prop osed the third data analysis approach in [34]. Data analysis tools p erform statistical analysis on aggregate data of surfers' temp orary identification, e.g. Cookie IDs and IP addresses. Analysis of cookie IDs and IPs is more privacy friendly than the cryptographic approach. Cookies store no p ersonal information, and they can b e blocked, or p eriodically cleared [33]. An IP is usually assigned to surfers temp orarily, and could b e shared by several surfers. The philosophy of the data analysis approach is not to change the industry model, and to obfuscate individual surfers' identities, but still achieve satisfactory levels of fraud detection. Data analysis techniques [34, 35] can identify sp ecific patterns that characterizes fraudulent traffic [32]. For instance, to uncover a primitive hit inflation attack where a script continuously simulates clicks on advertisements, we prop osed a simple Bloom-based [4] algorithm to detect duplicates in a stream of impressions or clicks [34]. Exp eriments on a real network were revealing. Interestingly, one of the advertisements was clicked 10,781 times by the same cookie ID in one day. Hence, this approach can reveal malicious intentions, and thus complement the classical tools that cannot distinguish low-quality traffic from fraudulent traffic. The classical and the cryptographic approaches did not distinguish b etween attacks made by one publisher or a group of publishers forming a coalition. Meanwhile, making this distinction is at the heart of the data analysis approach. By forming coalitions, fraudsters share large p ools of attacking machines. Hence, they avoid the cost and overhead of launching highly distributed attacks individually. In addition, coalition attacks are more difficult to detect since the patterns of fraudulent traffic are shared by numerous publishers. Therefore, coalition attacks can easily circumvent the classical tools. However, the data analysis approach looks for signs of publishers' coalitions. In [35], we devised an algorithm that detects a sp ecific attack [3]. In this pap er, we devise a generalized technique that detects all forms of coalition attacks. We first build on published theoretical results to devise the Similarity-Seeker algorithm that discovers coalitions of size 2. We then extend Similarity-Seeker to discover coalitions of arbitrary sizes. Interestingly, when implemented on a real network, Similarity-Seeker detected numerous coalitions of various sizes launching attacks with a variety of techniques. The rest of the pap er is organized as follows. We start by setting 2 The quality of a click or an impression is an estimate of the probability that it yields a sale. Session: E-Commerce and E-Content the stage for discovering coalition attacks in Section 2. We describ e two brute force algorithms in Section 3. Section 4 develops one of the brute force algorithms into SimilaritySeeker that discovers coalitions made by pairs of sites. The generalized problem of detecting coalitions of several sites is explored in Section 5. We comment on the findings on a real network in Section 6. We discuss the related work in Section 7, and conclude with our future work in Section 8. 2. FRAUD DETECTION BY TRAFFIC ANALYSIS The basic premise of traffic analysis is to draw correlations b etween the attacking machines, identified by IPs and cookies, and fraudsters. To circumvent the data analysis techniques, fraudulent publishers should dilute the strong correlation b etween their sites and the machines from which the attacks are launched. This can b e done either on the side of the attacking machines or on the side of the attackers' sites. Hence, the sp ectrum extremes of the hit inflation attacks are: · To dilute the correlation on the machine(s)' side, noncoalition attack are p erformed by one fraudster, while obliterating or frequently changing the identification of the machine(s), or using a huge numb er of machines. · To dilute the correlation on the sites' side, coalition attack are p erformed by many fraudsters, where any machine can b e used to simulate traffic to any site. If we detect b oth non-coalition and coalition attacks, then any attacker on the sp ectrum is also detectable. Hence, the attackers will have no leeway around the fraud detection system, and the hit inflation problem is ultimately solved. Understandably, the difficulty of detecting a non-coalition attack increases as the numb er of machines from which the attack is launched increases. In its simplest form, launching an attack from one machine, identified by one cookie-ID, can b e detected trivially by checking for duplicate impressions and clicks [34]. However, launching an attack from multiple machines is much harder to detect, since the detection algorithm has to examine the relationship b etween each publisher and all the machines generating traffic. Although the straightforward use of network anonymization, e.g. tor.eff.org, is attractive for inexp erienced fraudsters, it is not effective. Those services were designed to protect surfers' privacy. Hence, they block surfers' cookies. Therefore, using network anonymization can b e trivially detected by monitoring the p ercentage of cookie-less traffic p er publisher and investigating publishers whose traffic deviates from the norm. Similarly, on real networks, we notice some novice fraudsters generating a lot of traffic from ISPs that assign virtual IP addresses to surfers, such as AOL r . However, the ranges of IPs of those ISPs are well known, and again, the ratio of the traffic of any publisher received from those ISPs is highly stable across all the publishers. Hence, such attacks are easily detected by examining the ratio of the traffic received from ISPs assigning virtual IPs as compared to the entire publisher's traffic. This is where the complexity of launching scalable attacks b ecomes clear. Hard-to-detect attacks from several machines could b e costly and unscalable to launch. To have a normal p ercentage of cookie-less traffic, and a normal p ercentage of non-virtual ISPs' IPs, fraudsters are motivated 242 WWW 2007 / Track: E*-Applications Session: E-Commerce and E-Content not exclusively go to one publisher (Figure 1). Rather, each attacking machine contributes a small p ortion to the traffic of each fraudster in the coalition. This hinders noncoalition radars from associating any publisher with a reasonable amount of resources. Therefore, forming coalitions is the p erfect solution for fraudsters to launch scalable attacks from machines whose identifications are weakly correlated with their sites. Resources of Fraudster 1 Resources of Fraudster 2 ...... Resources of Fraudster Q Site of Fraudster 1 Site of Fraudster 2 ...... Site of Fraudster Q 2.2 Detecting Fraudsters' Coalitions (a) In non-coalition attacks, every fraudster generates traffic to its site only. Resources of Fraudster 1 Resources of Fraudster 2 ...... Resources of Fraudster Q Site of Fraudster 1 Site of Fraudster 2 ...... Site of Fraudster Q (b) In coalition attacks, every fraudster generates traffic to its and other sites in the coalition. Figure 1: Non-Coalition versus Coalition Attacks to either own the attacking machines, or to control the machines of real surfers through Tro jans in order to use the IPs and cookies of real surfers. In other words, launching scalable attacks entails high cost or requires some Tro janwriting skills, since out-of-the-b ox Tro jans are easily detected with anti-virus softwares, and hence are unscalable. Thus, the correlation b etween the difficulty of launching an attack and the difficulty of detecting it is understandable. To detect coalition attacks, the commissioner has to search for publishers' sites with highly similar traffic. A reliable fingerprint of a site's traffic is the set of IP addresses generating the traffic3 . Given the set of distinct IPs visiting each site, the commissioner has to search for sites whose traffic entries are generated from roughly the same set of IPs4 . Since the traffic entries of different publishers are interleaved in the log file, the traffic is scanned first to obtain the set of IPs visiting each site, which is then stored in a separate file. For each publisher, A, let SA b e the set of IPs visiting A. Define SB analogously. Several measures of the similarity b etween SA and SB exist [16], including the Dice | SA coefficient, 2 |SA |+SB || ; the cosine coefficient, |SA SB | ; | SB |SA |×|SB | and the Jaccard coefficient, |SA SB | . The goal is to discover | SA S B | all pairs of sites whose similarity exceeds some threshold, s. Fortunately, as shown in Section 6, any two legitimate sites have negligible similarity. 3. THE BRUTE FORCE ALGORITHMS We start by describing the two brute force alternatives in Sections 3.1, and 3.2. Since efficiently detecting coalitions is hard, we develop the second algorithms further to detect coalitions of pairs of sites in Section 4. In Section 5, we extend the algorithm to detect coalitions of arbitrary sizes. 2.1 Forming Fraudsters' Coalitions To avoid the cost of launching scalable hit inflation attacks, fraudsters shift their strategy from launching distributed non-coalition attacks, to launching coalition attacks where many fraudsters share their resources (machines used in the attack). That is, the machines are used to generate traffic for several sites. For instance, assume attacker A can generate u hits from each machine it controls, without b eing rejected by the commissioners' non-coalition radars. Another attacker, B , can simulate another u undetectable hits p er controlled resource. Assuming no overlap b etween A's and B 's resources, it is more scalable and cost effective to time-share the resources, and generate 2u hits for each publisher, instead of doubling the numb er of resources. The argument can b e extended for larger coalitions as illustrated in Figure 1. Forming coalitions serves two main purp oses. The first purp ose is to increase the traffic and the revenue while maintaining the same cost p er fraudster. The second purp ose is to blur the relationship b etween the identities (IPs and cookie IDs) of the attacking machines and the attackers' sites. Launching coalition attacks does not need sp ecialized Web development skills. It only needs minimal resources, and the knowledge of other fraudsters. For commissioners, detecting coalition attacks is challenging since the traffic coming from any attacking machine does 3.1 The First Brute Force Algorithm: All-Pairs The first algorithm considers every p ossible pair of sites. For every pair, A and B , the Al l-Pairs algorithm calculates any of the three aforementioned similarity coefficients b etween their sets of IPs by determining the size of the intersection (SA SB ) using a sort-merge procedure. Assuming the main memory cannot accommodate the entire traffic, but can accommodate the traffic of any two sites, then the sort-merge procedure can b e done in memory. However, the traffic is read from disk O (D) times, where D is the numb er of publishers' sites. An average-sized commissioner has around 50,000 sites. The traffic of each site can b e presorted. The presorting step is O (D|S | log (|S |)), where S is the largest IP set. The numb er of distinct IPs visiting a site in a day is around 25,000, but reaches 1,000,000 for some sites. Presorting reduces the total in-memory complexity from O (D2 |S |2 ) to O (D|S |(D + log(|S |))). 3 Through the sequel, we concentrate on fraud detection, and thus we assume the source IPs of the traffic are not sp oofed. Counteracting sp oofing has b een studied in [15]. 4 To simplify the presentation, we assume all the IPs are equally treated. However, the entire discussion can b e trivially generalized for the case where sp ecial IPs, such as those coming from a sp ecific location or known to b elong to Internet Service Providers (ISPs), are given different weights. 243 WWW 2007 / Track: E*-Applications Session: E-Commerce and E-Content cients, resp ectively. Three hash-based sampling techniques were prop osed by Broder et al. to conserve the similarity. Since our final algorithm is based on the Al l-IPs brute force algorithm, we discuss the three techniques in Section 4. Optimizing All-Pairs. The Al l-Pairs algorithm can b e enhanced by using the triangle inequality. This limits the optimization to the Jaccard coefficient since 1 - |SA SB | , the Jaccard dissimilarity of SA | SA S B | and SB , satisfies the triangle inequality. That is, for any + 1 three sites A, B , and C , it is true that - | SA S C | | SA S C | 1 1 . - | SB S C | - | SA S B | The dissimilarity based | SB S C | | SA S B | on the other two coefficients does not satisfy the triangle inequality. The optimization entails looking for a third site, C , such that A and B can b e judged similar or dissimilar using the similarity already calculated for the pair A and C , and the pair B and C in a dynamic programming scheme. Thus, when testing A and B for having similarity exceeding the threshold s,+f there exists a s,ite, C , such that i1 1 - |SB SC | then (1 - s) (1 - s ) - | SA S C | | SB S C | 1 , | SA S C | | SA S B | - |SA SB | and A and B are similar. C tr 1 onversely, if -he1e exists a site, C , such that (1 - s) , - | SA S C | - | SB S C | then A and B are dis| SA S C | | SB S C | . 1 similar b ecause (1 - s) - | SA S B | | SA S B | However, as shown in Section 6, legitimate sites have slim similarity, which makes this optimization ineffective for faster discovery of sites with similar traffic. 4. DETECTING COALITIONS OF PAIRS OF SITES In this section, we develop Al l-IPs to efficiently detect all coalitions of size 2. In Section 4.1, we describ e the sampling of IPs visiting sites, and derive the sample size that guarantees a sp ecific error. In Section 4.2, we prop ose SimilaritySeeker for detecting sites with similar traffic. 4.1 Similarity-Sensitive Sampling The IP sets should b e sampled, such that any pair of sites, A and B , whose IP sets, SA and SB , have a similarity exceeding a threshold, s, is discovered with high probability. For any set S defined on domain N , let : b e a p ermutation of chosen uniformly at random; g : N b e an arbitrary injection; and M ODy (S ) b e the subset of the elements that are 0 modulo y . Yy ,g, (S ), defined as M ODy (g ( (S ))), is a similarity-sensitive sample of S . In [9], |Yy,g, (SA )Yy,g, (SB )| was shown to b e an unbiased estima|Yy,g, (SA )Yy,g, (SB )| 3.2 The Second Brute Force Algorithm: All-IPs Instead of p erforming a sort-merge for every p ossible pair of sets of IPs, the Al l-IPs algorithm p erforms a sort-merge for all the sets together. Hence, all sites sharing a sp ecific IP are identified, and Al l-IPs calculates the similarity of pairs that have a common IP. Counting the numb er of IPs shared by any pair of sites requires one scan on the sorted data. However, the sort-merge is done out-of-memory5 . Hence, the traffic is read from disk O (log(D)) times. The first presorting scan has in-memory complexity of O (D|S | log(|S |)). Then, O (log(D)) scans are made for merging, each has an inmemory complexity of O (D|S |), yielding a total in-memory complexity of O (D|S |(log(D) + log(|S |))). The Al l-IPs brute force algorithm has b een prop osed b efore for measuring set similarity in the context of finding similar files and domains on the Internet [9, 10, 13]. In a collection of D documents, each document is represented by the set of all statements of a sp ecific length the document contains. This is analogous to our problem of finding highly similar sets of IPs among a large collection of sets. tor of the Jaccard coefficient, |SA SB | . Although [9, 10, 13] | SA S B | tackled the problem in the context of Jaccard similarity only, we like to p oint out the generality of the sampling method. |Y , , (SA )Yy, , (SB )| It is easy to show 2 |Yyygg(SA )|+|Yygg (SB )| is an unbiased es,, ,, |Y (S ) Y (S )| timator of the Dice coefficient, and y,g, A y,g, B |Yy,g, (SA )|×|Yy,g, (SB )| Optimizing All-IPs. To compact the sets, Broder et al. [9, 10, 13] prop osed sampling. Reducing the size of S decreases the computations. Sampling results in the sort-merge producing fewer IPs as well as fewer sites sharing any IP. However, random sampling from the two sets SA and SB greatly skews the size of the intersection. In the worst case, random sampling can lead to an empty intersection even though the actual intersection is not empty [17]. Therefore, the sampling technique used should scale down ||SA SB | by SA | × |SB |, the same factor it scales down |SA | + |SB |, or |SA SB |, to conserve the Dice, cosine, or Jaccard coeffi5 [46] gives a good survey on external memory algorithms. is an unbiased estimator of the cosine coefficient Broder et al. prop osed other techniques to estimate Jaccard similarity in sp ecific. For any set, S , from a totally ordered domain, let M I Nz (S ) b e the subset containing the smallest z elements in S if |S | z ; otherwise M I Nz (S ) = S . Let SA and SB b e the sets of IPs visiting sites A and B , resp ectively. Define Zz , (SA ) as M I Nz ( (SA )), and define Zz , (SB ) analogously. It was shown in [13] that an unbiased estimator of the Jaccard coefficient of SA and SB is |M I Nz (Zz , (SA )Zz , (SB ))Zz , (SA )Zz , (SB )| . given by |M I Nz (Zz , (SA )Zz , (SB ))| The numb er of samples generated by the former method, based on Yy ,g, (S ), grows with |S |, which could b e inconvenient; while the numb er of sample generated by the latter method, based Zz , (S ), is fixed. On the other hand, the former method is easier to calculate, and can b e used for all similarity coefficients. However, it is difficult to derive any guarantees on the quality of b oth estimators. In [10], a MinHash-based Jaccard estimator was adopted. Since this is the only estimator that we can draw error guarantees for, we use it in the sequel. Let i b e a p ermutation of chosen uniformly at random. For a set S = {s1 , s2 , . . . } - on , let mini (S ) b e i 1 (min(i (s1 ), i (s2 ), . . . )). Therefore, for sets SA and SB , mini (SA ) = mini (SB ) if and only if mini (SA SB ) (SA SB ). Since i is chosen uniformly at random, then all the elements in (SA SB ) are equiprobable to b ecome mini (SA SB ). Hence, mini (SA ) = mini (SB ) with probability |SA SB | , which is the Jaccard | SA S B | coefficient of SA and SB . Therefore, we can calculate an unbiased estimate of the Jaccard coefficient of the pair SA and SB as follows. Construct a set of n indep endent uniform p ermutations, 1 , 2 , . . . , n , and calculate mini (SA ) and mini (SB ) for 1 i n. The unbiased estimate is 244 WWW 2007 / Track: E*-Applications i i A B given by , which is the ratio of n p ermutations (samples) where b oth minimums agree. Session: E-Commerce and E-Content Pro cedure: Samples-Select (Integer n) b egin //Constructing p ermutations for (Integer i = 1, i n, i + +){ Construct i using the technique in [12]; }// end for //Traffic separation for each TrafficEntry e = (S ite I D, I P ){ Write e to the traffic file of S ite I D on disk; } //Selecting samples for each S ite I D { IP Site-Samples [n] = empty array; for each TrafficEntry e = (S ite I D, I P ){ for (Integer j = 1, j n, j + +){ if (j (I P ) < j (Site-Samples [j ])){ Site-Samples [j ] = I P ; } }// end for }// end for let S ite I D b e stored at row i in Samples Write Site-Samples to Samples [i] on disk; }// end for end; C ount(i| min (S ) = m i n ( S )) 4.1.1 How Big is n? Error Analysis of the Jaccard Estimator. In advertising networks, having guarantees on the quality of the results is very crucial. The commissioner has to know, with very low error rate, sites whose traffic are highly similar. Discarding sites erroneously reduces the commissioner's revenue, while charging advertisers for fraudulent traffic puts the commissioner at risk. However, no robust error analysis for the estimators in [9, 10, 13] was provided. Since each p ermutation generates one sample, to discuss error analysis, we use the two terms interchangeably. We calculate the minimum numb er of samples, n, for a sp ecific confidence interval estimate on the Jaccard coefficient of SA and SB using the sampling method prop osed in [10]. For each p ermutation i , we model comparing the samples, mini (SA ) and mini (SB ), as an indep endent Bernoulli trial with unknown success probability, p = |SA SB | . | SA S B | A Bernoulli random variable is 1 with probability p, and is 0 with probability (1 - p). Based on n samples, the Maxix mum Likelihood Estimator of p is given by p = n i , where ^ - the variance of p is given by p(1n p) . Since the distributions ^ of the samples are indep endent, and identical, the central limit theorem and the law of large numb ers state, for large n, p is approximately normally distributed. Thus, if K de^ notes the cdf of the standard normal distribution b etween - a ^ < K/2 - and , then b oth Pr K/2 < p-p p(1-p)/n - a ^ nd Pr K < p-p re equal to 1 - . p(1-p)/n -^ ^ terval for |SA SB | is given by p ± K/2 p(1n p) , and a one| SA S B | si^ ed (1 - ) appro. imate confidence interval is given by d x ^ p(1-p) ^ ,1 p - K n Figure 2: The Samples-Select Procedure. at random from F , then for any x , Pr(min( ()) = 1 (x)) = || . That is, all the elements of any domain, , have equal chances to b e the minimum element of the image of under any . However, MWI families are impractical, since the cardinality of any such family is at least e||-o(||) [11]. In our case, is the IP domain. To achieve easy computation, Indyk prop osed a relaxed version of MWI, -MWI [24], defined as follows. For any domain N , and x (N - ), if : N N is chosen 1 at random from F , then Pr( (x) < min( ())) = |± 1 . |+ PWe rewrite this in a simpler form as, if x , then 1 | r( (x) = min( ())) - || . That is, all the ele| ments of any domain, , have almost equal chances to b e the minimum element of the image of under any . [24] provides a construction of compact -MWI p ermutations. An even more practical variation of MWI is pair-wise indep endent (PWI) p ermutations [11]. F is a family of PWI p ermutations if for any {s1 , s2 , t1 , t2 } , s1 = s2 , t1 = t2 , if is chosen at random from F then Pr(( (s1 ) = t1 ) ( (s2 ) = t2 )) = ||×(1|-1) . Although [11] showed that PWI families | can b e viewed as -MWI families with as large as log , in practice, they have many implementations and are widely used [12]. For instance, the p erformance of linear indep endence, of the form i (x) = ai x + bi modulo c, is acceptable in real life if {1, . . . , c}, , c , c is a prime, ai and bi are chosen at random, and ai = 0 [11, 6]. [12] provided an efficient way to construct a family of -MWI p ermutations using a random invertible Boolean matrix, coupled with a redundant input representation, yielding an of 0.25 in the worst case, and 0.06 in practice. We choose this implementation [12] due to its guaranteed tolerable error. Therefore, a two-sided (1 - ) approxim^ te confidence ina One-sided confidence intervals are more interesting for our application, since we are looking for pairs of sites with Jaccard coefficient exceeding a threshold s. Therefore, the minimum size, n, of the p ermutations family that would guarantee (1 - ) confidence that the estimator exceeds s ^ K 2. is n = p(1 - p) p-s ^ ^ Clearly, the sample size is not b ounded in the general case. However, if p is to b e estimated within a margin of error of , then this b ounds n by K [ 7]. Statistically, this is interpreted as: with prob2 2 ability at least 1 - , the estimator is no more than less than the user thre1 hold, s, i= ., Pr (p > s - ) 1 - . For s .e ^ 2 .645 instance, n = 423 p ermutations guarantees 2×0.04 that a pair of sites is estimated to b e similar is truly similar within an error of = 0.04 with probability 0.95. 4.2 The Similarity-Seeker Algorithm 4.1.2 A Discussion of Permutations. Random p ermutations are needed to implement this Jaccard estimator. Representing each truly random p ermutation requires |N | log |N | bits, where practically |N | = 264 [24]. This motivated Broder et al. to define min-wise indep endent (MWI) p ermutations in [11]. F is a family of MWI p ermutations on a domain N , if when is chosen Now, we have a reliable sampling technique that efficiently estimates the similarity of huge sets of IPs visiting sites. Moreover, for a given error b ound and confidence, we know the numb er samples to collect. We use this sampling method to describ e the Similarity-Seeker algorithm. The algorithm starts by collecting samples using the technique in [12]. It uses every p ermutation to discover all p os- 245 WWW 2007 / Track: E*-Applications Algorithm: Similarity-Seeker (Double s, , , Integer l) b egin //Calculating thK numb;er of samples e Integer n = 2 2 Session: E-Commerce and E-Content is probably the IP address of an Internet Service Provider (ISP) or a Network Address Translation (NAT) b ox, that is shared by hundreds of computers. Otherwise, for each pair in a list sharing a sample, a corresp onding element is created with a numb er of shared samples of 1, and is inserted in C , if it does not already b elong to C , or is incremented otherwise. At any time, if the incremented pair satisfies the similarity s, Similarity-Seeker outputs this pair as similar. The algorithm for discovering all pairs with a similarity exceeding s with confidence and error is presented in Figure 3. //Creating an array of samples IP Samples [|S ites|][n] = empty D × n array; Samples-Select (n); //Allo cating space for C Set C = empty set; for (Integer j = 1, j n, j + +){// IPs lo op //Allo cating space for H HashTable H = empty hash table; for (Integer i = 1, i |S ites|, i + +){// Sites Lo op //Populating H with lists of sites sharing an IP let S ite I D b e the SiteID at Samples [i] let I P b e the IP at Samples [i, j ] Insert S ite I D into H [I P ].SiteList; }// end for //Incrementing similarity of sites sharing the j th IP for each SiteList S L in H { if (|S L| < l){// Sites do not share a p opular IP for each two sites, A and B , in S L{ if ((A, B ) C ){ C [(A, B )].Integer ++; if (sn - 1 C [(A, B )].Integer > sn){ Output (A, B ) as similar; } } else{ Insert (A, B , 1) into C ; } }// end for } }// end for }// end for end; 4.2.3 The Similarity-Seeker Complexity. In addition to the two scans on the traffic date made by Samples-Select, Similarity-Seeker makes only one scan on Samples. Since Samples is typically smaller than the entire traffic, then the traffic is considered to b e scanned only thrice, as compared to the O (log(D)) in the Al l-IPs algorithm, where D is around 50,000. Calculating the inmemory complexity of Similarity-Seeker is more involved. To establish a b ound on the in-memory complexity of processing one column by Similarity-Seeker, we have to consider two extreme cases. The first is where l - 1 sites share the same sample, and all the other D -(l-1) samples are shared 2 by two sites. The second is where all the D samples are clustered to form lists of length l - 1. Then the complexm l . ity of processing one column is O ax 2 + D, lD The complexity ofl any nonaextreme case is clearly a linear combination of O 2 + D nd O (lD). Since the lD factor dominates, and there are n colwmns, the total complexity of u l ith a tiny hidden constant. In Similarity-Seeker is O D 2 practice, the complexity is less due to the dissimilarity of IPs visiting sites, and hence less clustering of sites. We examine the relationship b etween the estimated sites' similarity and the parameter l in Section 6. Interestingly, some site pairs retain their similarity, no matter how small l is set. Due to the smaller numb er of sets, D, in our case (50,000 instead of all the Internet documents) we assumed that all the D samples from one p ermutation can fit in memory. Thus Similarity-Seeker avoids the out-of-memory sort-merge p erformed by Al l-IPs with all the associated I/O and computational overheads. Figure 3: The Similarity-Seeker Algorithm. sible pairs of sites that share an IP in their samples. The algorithm employs three data structures: Samples, C , and H . Samples is a D × n array whose rows represent sites, columns represent p ermutations, and cells represent samples for sites under p ermutations. The Samples array is p opulated by the Samples-Select procedure describ ed next. C is a set of triplets: two site IDs and the numb er of shared samples. C tracks pairs of sites with shared samples exceeding sn, where s is the similarity threshold. H is a hash table of tuples: a sample ID, and a list of sites sharing it. 4.2.1 Selecting the Samples. To p opulate Samples, the Samples-Select, as sketched in Figure 2, procedure makes two preliminary scans on the traffic data. In the first scan, Samples-Select separates the traffic of each site in an individual file that can fit in memory. For each site, Samples-Select makes a second pass where it hashes each traffic entry using all the p ermutations. After collecting all the samples, the file is written back to the right row in Samples. We only require the memory to accommodate one row or one column of Samples at a time. This increases the scalability of the prop osed algorithm. The inmemory complexity of Samples-Select is O (D|S |n). 5. DETECTING COALITIONS OF ARBITRARY SIZES Similarity-Seeker can efficiently detect coalitions of pairs of sites. However, as we discuss in Section 5.1, attackers form coalitions of sizes exceeding 2 to stay under the radar level by giving up some greed. Hence, we extend SimilaritySeeker to detect larger coalitions in Sections 5.2, and 5.3. 5.1 Greed versus Subtleness Given a Coalition Size A group of attackers, of size Q, can make hard-to-detect coalitions by not sharing all the resources together. Instead, each publisher would share each resource it controls with only q random attackers. Hence, as shown by Theorem 1 the pairs' similarities drop, while still gaining from coalitions. Theorem 1. A group of attacking publishers, of size Q, where each publisher shares each resource it controls with only q < Q random publishers, reduces similarity between q( 1) pairs from 1 to 2(Q-1)+q+2Q-q-3) and achieves a gain of q . q( 4.2.2 Discovering Similar Pairs. Once Samples is p opulated, Similarity-Seeker reads it in a column-ma jor manner. For every column (p ermutation), a hash table, H , is temp orarily constructed for holding D samples and their corresp onding lists of sites sharing each sample. The hash table is p opulated after one scan on the column, and the list of sites sharing each IP is p opulated. A scan is then made on H . Any list that contains a large numb er, l, of sites sharing a sample is discarded. This sample 246 WWW 2007 / Track: E*-Applications Proof. Assume each site in the attacking group controls r resources, the Jaccard coefficient is used for similarity, and that a resource shared by A with B is not re-shared by B . For any two sites, A and B , SA SB is given by the resources shared by A with B , the resources shared by B with A, and the resources shared by all the other Q - 2 nodes with b oth (q q A and B . This is equal to 2r Qq 1 + r (Q - 2) × (Q-)()(-1) 2) = - 1 Q- q r (q + 1) Q-1 . SA SB is given by the resources controlled by A, the resources controlled by B , and the resources shared by all the other Q - 2 nodes with either A or B . This (q q is equal to 2r + 2r (Q - 2) Qq 1 - r (Q - 2) × (Q-)()(-1) 2) . - 1 Q- q( 1) Hence the Jaccard coefficient is given by 2(Q-1)+q+2Q-q-3) . q( The resources directing traffic to each site is r (q + 1), after forming the coalition, instead of the r resources controlled by each site. 2 Session: E-Commerce and E-Content edges connect pairs of sites with similar traffic. The ob jective is to search for, instead of just edges, all maximal cliques in this huge graph. 5.3 Discovering All Maximal Cliques in the Sites' Similarity Graph Let G = (V , E ) b e the sites' similarity graph, where the set of nodes V represent sites, and E is a set of edges connecting pairs of sites if and only if their similarity is at least s. We will b e interchangeably referring to sites and nodes representing them. For a subset W V , G(W ) = (W, E (W )) with E (W ) = {(v , w) W × W |(v , w) E } is called a subgraph of G induced by W , and G is called a supergraph of G(W ). G(W ) is said to b e a clique if and only if (v , w) E v , w W . G(W ) is a maximal clique if it is not a proper subgraph of another clique. The ob jective is to find all maximal clique ¯ ¯ in the sites' similarity graph G. The graph G = (V , E ) is a ¯ complementary graph of G if E = {(v , w) V × V |(v , w) / E }. A subset W V is an independent set if and only if (v , w) E v , w W . W is a maximal independent set if it / is not a proper subgraph of another independent set. Finding all maximal clique in a graph G is equivalent to finding all ¯ maximal independent sets in G. Two seminal algorithm for enumerating all cliques are provided in [14]. The first algorithm is a basic recursive branch and backtrack technique. The second algorithm is an optimized variation that prunes the search space faster. | [ The complexity of the second algorithm is O V |3 3 43]. Among all the variations [19, 26, 43, 44] of the al3 orithms g b |V | |V | Corollary 1. By increasing the size of a coalition, attackers can sustain similarity between pairs at a low level, while stil l increasing their gains. Proof. To keep pairs' similarity b elow a detectable level, s, the gain of each publisher is q , where q is given by 2(s1 1) × +. ( ( 2Qs + 1)2 - (4Qs2 + 1) + (1 - s)2 2Q - 3)s - 1 + Hence, similarity b etween pairs can b e sustained at any level, s, while increasing the gain, q , by forming larger groups, i.e. increasing Q. 2 From Theorem 1, if q > 1 and q = Q, the Jaccard 1 coefficient is less than Q . Fortunately, as shown in Section 6, any two legitimate sites have negligible similarity. Therefore, even if attackers form large coalitions, the subtle similarity b etween pairs is still ab ove the norm, and is still detectable by Similarity-Seeker. 5.2 Detecting Large Coalitions based on Pairs of Sites Increasing the size of coalitions from pairs to large groups shifts our focus from searching for pairs of sites with highly similar traffic to searching for groups with moderate similarity. For two main reasons, the solution in Section 4 has to b e extended for coalitions of arbitrary sizes. First, outputting one set of several sites, such that each pair of sites have similar traffic, establishes the evidence for the fraudsters' malicious intention. It is very unlikely, for any random group of sites of size Q, that all p ossible pairs are similar. If any two publishers' sites can b e mistakenly judged to have similar traffic with probability . Then, mistakenly judging that Q random sites are involved in a coalition attack has Q(Q-1) a probability of 2 . For instance, if = 0.1, then erroneously judging a small group of 5 random sites to b e in coalition has a probability of 10-10 . Second, outputting one set of several sites is more concise than outputting all the p ossible pairs in that set. For instance, if 3 coalitions of size 50 fraudsters are discovered, it is more convenient for the management to verify a list of 3 entries, each of size 50, than to examine 3 × 50×49 = 3225 2 entries, each of size 2. In addition, the output conciseness gives a more panoramic picture of the coalition and facilitates manual investigations like checking for common features of the sites, such as contract date, and earning rate. Therefore, we need to condense the detected pairs of sites into groups of sites. sites' similarity can b e modeled as an undirected graph whose nodes represent sites, and whose 3 y in [14], [43] was able to reach a complexity of O integrating the output function into the algorithm. This is |V | optimal, since a graph can contain up to 3 3 cliques 6 [37]. The maximal cliques enumeration algorithm in [43] has p olynomial storage requirements, and is optimal in the worst case. However, since the sites' similarity graph is extremely sparse as shown in Section 6, quantifying the complexity in terms of the numb er of maximal cliques in the graph is crucial to our application. Although this involves complex analysis [43], the original exp eriments in [14] showed the average time to identify a maximal clique, i.e., a coalition attack, is not dep endent on the size of the graph or the numb er of maximal cliques. We recommend the implementation of [43] due to its optimal worst case complexity, though other algorithms established complexity b ounds in terms of the numb er of maximal cliques in the graph, i.e., the numb er of coalitions. For instance, [44] combined the pruning techniques in [2] and [14] to find all the maximal indep endent sets with a complexity of O (|V ||E |µ), where |V |, |E |, and µ are the numb ers of nodes (sites), edges, and cliques in the graph. This algorithm was | improved in [19] to O (a(G)|E |µ), where a(G) O( V |). 6. FINDINGS ON A REAL NETWORK We have devised Similarity-Seeker for detecting coalitions of pairs of attackers, and then extended it for coalitions of arbitrary sizes. To check the validity and the effectiveness of our development, we comment on our comprehensive set For instance, let G b e a graph of x triplets, i.e., |V | = 3x. Two nodes are connected if they b elong to different triplet. Hence, there are 3x cliques, each of size 3. 6 247 WWW 2007 / Track: E*-Applications Session: E-Commerce and E-Content Variation of Number of Site Pairs with Similarity Excluding IPs shared by l or more sites l >= Number of Pairs 1.E+8 1.E+7 1.E+6 1.E+5 1.E+4 1.E+3 1.E+2 1.E+1 1.E+0 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 1000 sites 30 sites 100 sites 20 sites 50 sites 10 sites 40 sites Similarity Figure 4: The Number of Pairs (Logarithmic Scale) Having a Specific Similarity. of exp eriments using real data. We describ e our exp erience with building a fraud detection system at Fastclick, Inc., a ValueClick company. For proof of concept, we analyzed a data sample of 54,045,873 traffic entries using SimilaritySeeker to discover site pairs with similar traffic. To reduce the noise, we excluded all the IPs that visited a large numb er of sites, b ecause such IPs probably b elong to NAT b oxes. We rep eated the exp eriment and progressively increased the parameter l, the numb er of sites b eyond which the IPs are disregarded, from 10 to 1,000. The results are plotted in Figure 4, with a logarithmic scale on the vertical axis. It is interesting to note that when l was 1,000, 98.94% of the pairs had less than 1% of similarity. When l was 10, 99.98% of the pairs had less than 1% of similarity. Clearly, the similarity b etween sites' IP sets is negligible. For each run, Similarity-Seeker output all the pairs with similarity more than 10%, and we fed the output to the cliques enumeration algorithm [43] to discover larger coalitions. In particular, when l, the numb er of sites b eyond which the IPs are disregarded, was 10, Similarity-Seeker output 81 pairs with similarity, s, at least 0.1. The 81 pairs contained 5 coalitions of size 3. As l increased, more pairs were discovered. For instance, when l was 30, 189 pairs of sites were found. Since more IPs shared by the disjoint comp onents were considered, the cliques enumeration algorithm was able to connect some of these comp onents into bigger coalitions. However, many of the cliques were overlapping. As l increased to 40, 647 pairs were found. It b ecame clear that the cliques discovered are highly overlapping and noisy, since p opular IPs that are shared by many legitimate sites are not discarded. To reduce the noise, we increased the minimum traffic similarity gradually b eyond 0.1. When s reached 0.5, for the same l of 40, the output of SimilaritySeeker comprised 406 pairs of sites that translated into exactly 1 p erfect clique of size 29. That was clearly a coalition attack, as discussed b elow. As l, the numb er of sites b eyond which the IPs are disregarded, increases, more legitimate pairs are output as similar, since p opular IPs shared by those sites are not discarded. To overcome this problem, it is advisable to increase the minimum required traffic similarity, s, as we increase l. We increased l to 50 sites and increased s gradually up to 0.6. There were 680 pairs. With very few anomalies that can b e manually identified, there was the original coalition of size 29, sharing 15 sites with another coalition of size 22. There were 8 other disjoint coalitions of sizes b etween 3 and 10, some isolated pairs, and a star of size 6. Among all the discovered 680 sites, there were strong evidences that more than 93% of them were real fraudsters. Most of the coalitions had all their sites signed up with the commissioner around the same date. The noticeable characteristic of the coalitions of size 29 and 22 was that their traffic was of moderate size, yet coming from IPs all over the world. After further investigations, we found that the Referer fields in the HTTP requests were coming from pages that do not have the commissioner's advertisements; and sometimes no advertisements at all. We susp ect the attackers had some form of readily available traffic through a network of Tro jans, and that they do not work for the domains they signed up for. When the commissioner sent the account activation e-mails7 , the publishers somehow acquired the attached activation secrets and activated the accounts. Since the activation secrets are stored in a hashed form, the attackers must have compromised some machines on those domains, and hence, acquired the attached secrets. For some of the isolated pairs, we noticed almost simultaneous traffic entries for the two sites of the coalition. We susp ect that those attacks were launched using the attack in [3] that will b e discussed shortly in Section 7. As l grew further, the cliques enumeration algorithm did not connect those coalitions, but rather started to output groups that share extremely p opular IPs. We checked those IPs on www.arin.net/whois, and they are ISP-owned IPs. We recommend setting l and s initially to small values, say 5 sites and 0.1 similarity. From there, the commissioner can tune the values of l and s according to the noise in the results, as describ ed ab ove. The noise is usually manifested in the numb er of isolated pairs, small coalitions, and overlapping coalitions. From our exp erience, an appropriate value for the error is 1s , and the plausible range for the 0 confidence is b etween 0.01 to 0.1. 7. RELATED WORK The work related to ours can b e classified into two categories. The first category is the recent research in discovering densely connected subgraphs in huge graphs. The second category is our previous work on coalition attacks. 7.1 Discovering Dense Subgraphs Due to the NP-hardness of the cliques enumeration problem, it was recently attempted to approximate this problem as discovering densely connected subgraphs or clusters. 7 When a publisher signs up with FastClick, FastClick sends the publisher an e-mail on the domain signed up, with a secret key. The publisher can activate the account only using this secret key. This ensures that the publisher has an e-mail account on the domain (s)he signed up for. 248 WWW 2007 / Track: E*-Applications Three main approximations were prop osed. The conductance measure [8, 18, 22, 27] of a subgraph is a measure of the numb er of the external edges (bridges to the rest of the graph), in comparison to the internal edges of the subgraph, and the internal edges of the rest of the graph. The second cluster editing approximation [41] b ounds the numb er of edges to b e added or deleted to transform a subgraph into an isolated clique. The third approximation [1] b ounds the average internal degree of nodes. However, discovering any cluster with a b ound on any of the three approximations is proved to b e NP-complete [23, 42]. Being as hard as the original problem, the approximations are of limited use. The algorithm in [21] efficiently discovers dense clusters. However, the algorithm associates no connectivity metrics with the identified clusters. It hence provides no guarantees on the clustering quality. The algorithm cannot answer queries ab out clusters with a sp ecific threshold on a connectivity metric, since the identified clusters can b e either split or combined to satisfy the query threshold. Although the algorithm is suitable for link spam detection, it is not applicable in money-sensitive fraud detection. Session: E-Commerce and E-Content attacks in their full generality, including the attack in [3] without the need to the ISPs' supp ort. 8. DISCUSSION AND FUTURE WORK We have prop osed a generalized solution for detecting generalized coalition attacks of hit inflation. Since sites' traffic is highly dissimilar, any similarity is usually suspicious. We modeled the problem of detecting fraud coalitions in terms of the set similarity problem. We built on several published theoretical results to prop ose our Similarity-Seeker algorithm that uncovers coalitions of site pairs. We then extended the detection algorithm to detect coalitions of any size by finding all maximal cliques in a graph. On real network data, 93% of the detected sites were provably fraudsters. This shows how accurate our model and algorithm are regardless of how the attack is designed. However, several publishers can collude to attack more than one commissioner. Each commissioner can only know the traffic of its publishers, and no commissioners can detect the similarity b etween the traffic of the attackers. Then, to detect attacks that span several advertising networks, we anticipate the development of new sp ecialized auditing, or "detective", entities that are trusted by commissioners. Our future work focuses on two directions. The first direction is to compare several cliques enumeration using real data. Although [43] has an optimal worst case b ound, other algorithms can outp erform it for extremely sparse graphs. The algorithms in [19, 20, 31, 43] were never exp erimentally compared, to the b est of our knowledge. The second direction of our future work is to extend the algorithms to the streaming environment. Bearing in mind that an averagesized commissioner has around 50,000 publishers' sites, and receives around 70M traffic entries p er hour, it is desirable to detect coalition attacks using only one scan on the data. 7.2 Previous Work on Coalition Attacks We have previously prop osed an algorithm to detect the coalition attack identified in [3]. The attack in [3] involves a coalition of a dishonest publisher, P , with a dishonest Web site, S . S 's page will have a script that runs on the surfer's machine when its page loads, and automatical ly redirects the surfer to P 's Web site. P will have two versions of its Web page, a non-fraudulent page; and a fraudulent page. The non-fraudulent page is a normal page that displays the advertisement, and the surfer is totally free to click it or not. The fraudulent page has a script that runs on the surfer's machine when it loads, and automatical ly clicks the advertisement. P selectively shows the fraudulent page when the Web site that referred the surfer to P is S . The attack silently converts every innocent visit to S to a click on the advertisement in P 's page. Several factors make the attack virtually imp ossible to detect. First, if the commissioner directly visits P 's page, the non-fraudulent page will b e loaded. Second, the commissioner cannot know the Referer fields of the HTTP requests to publishers. To identify S , the commissioner has to check all the Internet sites, which is infeasible. Third, the attack is done in an automatic way that is hidden from the surfer. In [35], we prop osed a solution this sophisticated coalition attack via a collab oration b etween commissioners and ISPs. By analyzing the aggregate stream of HTTP requests, the ISP can detect sites that are usually visited b efore a sp ecific site, without violating the surfers' privacy. Bearing in mind the size and the sp eed of HTTP requests made to ISPs, the problem b oils down to identifying associations b etween HTTP requests that are not widely separated in a traffic stream. We devised the Streaming-Rules algorithm to detect associations among stream elements. Although the solution prop osed in [35] is effective, it is very sp ecific to the attack in [3]. The solution is not effective against other coalition attacks. For instance, if each attacker in the coalition controls a network of surfers' machines through Tro jans, then HTTP requests for attackers could b e widely separated in the ISP HTTP stream, and hence, are not detected by the solution in [35]. However, the general solution prop osed in this pap er detects coalition Acknowledgment We thank Dr. Jerry Qi Zheng for helping us with acquiring the real data, and for his useful discussions. 9. REFERENCES [1] J. Abello, M. Resende, and S. Sudarsky. Massive Quasi-Clique Detection. In Proceedings of the 5th LATIN Latin American Symposium on Theoretical Informatics, pages 598­612, 2002. [2] E. Akkoyunlu. The Enumeration of Maximal Cliques of Large Graphs. SIAM Journal on Computing, 2(1):1­6, 1973. [3] V. Anupam, A. Mayer, K. Nissim, B. Pinkas, and M. Reiter. On the Security of Pay-Per-Click and Other Web Advertising Schemes. In Proceedings of the 8th WWW International Conference on World Wide Web, pages 1091­1100, 1999. [4] Burton H. Bloom. Space/Time Trade-offs in Hash Coding with Allowable Errors. Communications of the ACM, 13(7):422­426, 1970. [5] C. Blundo and S. Cimato. SAWM: A Tool for Secure and Authenticated Web Metering. In Proceedings of the 14th ACM SEKE International Conference on Software Engineering and Know ledge Engineering, pages 641­648, 2002. [6] T. Bohman, C. Cooper, and A. Frieze. Min-Wise Independent Linear Permutations. Electronic Journal of Combinatorics, 7:R26, 2000. [7] A. Bowker and G. Lieberman. Engineering Statistics, 2nd Edition. Prentice Hall, 1972. 249 WWW 2007 / Track: E*-Applications [8] U. Brandes, M. Gaertler, and D. Wagner. Experiments on Graph Clustering Algorithms. In Proceedings of the 11th ESA European Symposium on Algorithms, pages 568­579, 2003. [9] A. Broder. On the Resemblance and Containment of Documents. In Proceedings of the IEEE SEQUENCES Compression and Complexity of Sequences, pages 21­29, 1997. [10] A. Broder. Identifying and Filtering Near-Duplicate Documents. In Proceedings of the 11th COM Symposium on Combinatorial Pattern Matching, pages 1­10, 2000. [11] A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-Wise Independent Permutations (Extended Abstract). In Proceedings of the 30th ACM STOC Symposium on Theory Of Computing, pages 327­336, 1998. [12] A. Broder and U. Feige. Min-Wise versus Linear Independence (Extended Abstract). In Proceedings of the 11th ACM-SIAM SODA Symposium on Discrete Algorithms, pages 147­154, 2000. [13] A. Broder, S. Glassman, M. Manasse, and G. Zweig. Syntactic clustering of the Web. In Proceedings of the 6th WWW International Conference on World Wide Web, pages 391­404, 1997. [14] C. Bron and J. Kerbosch. Algorithm 457: Finding All Cliques of an Undirected Graph. Communications of the ACM, 16(9):575­577, 1973. [15] CERT Coordination Center. CERT Advisory CA-1996-21 TCP SYN Flooding and IP Spoofing Attacks. http://www.cert.org/advisories/CA-1996-21.html, September 19 1996. [16] Moses S. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of the 34th ACM STOC Symposium on Theory Of Computing, pages 380­388, 2002. [17] S. Chaudhuri, R. Motwani, and V. Narasayya. On Random Sampling over Joins. In Proceedings of the 18th ACM SIGMOD International Conference on Management of Data, pages 263­274, 1999. [18] D. Cheng, S. Vempala, R. Kannan, and G. Wang. A Divide-and-Merge Methodology for Clustering. In Proceedings of the 24th ACM PODS Symposium on Principles of Database Systems, pages 196­205, 2005. [19] N. Chiba and T. Nishizeki. Arboricity and subgraph listing algorithms. SIAM Journal on Computing, 14(1):210­223, 1985. [20] L. Gerhards and W. Lindenberg. Clique Detection for Nondirected Graphs: Two New Algorithms. Computing, Volume 21(4):295­322, 1979. [21] D. Gibson, R. Kumar, and A. Tomkins. Discovering Large Dense Subgraphs in Massive Graphs. In Proceedings of the 31st VLDB International Conference on Very Large Data Bases, pages 721­732, 2005. [22] C. Gkantsidis, M. Mihail, and A. Saberi. Conductance and Congestion in Power Law Graphs. In Proceedings of the 22nd ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 148­159, 2003. [23] K. Holzapfel, S. Kosub, M. Maaß, and H. T¨ubig. The a Complexity of Detecting Fixed-Density Clusters. In Proceedings of the 5th CIAC Italian Conference on Algorithms and Complexity, pages 201­212, 2003. [24] P. Indyk. A small Approximately Min-Wise Independent Family of Hash Functions. In Proceedings of the 10th ACM-SIAM SODA Symposium On Discrete Algorithms, pages 454­456, 1999. [25] M. Jakobsson, P. MacKenzie, and J. Stern. Secure and Lightweight Advertising on the Web. In Proceedings of the 8th WWW International Conference on World Wide Web, pages 1101­1109, 1999. [26] H. Johnston. Cliques of a Graph-Variations on the Bron-Kerbosch Algorithm. International Journal of Session: E-Commerce and E-Content Computer and Information Sciences, 5(3):209­238, 1976. [27] R. Kannan, S. Vempala, and A. Veta. On Clusterings: Good, Bad and Spectral. In Proceedings of the 41st IEEE FOCS Annual Symposium on Foundations of Computer Science, pages 367­377, 2000. [28] D. Klein. Defending Against the Wily Surfer-Web-based Attacks and Defenses. In Proceedings of the 1st USENIX ID Workshop on Intrusion Detection and Network Monitoring, pages 81­92, 1999. [29] M. Liedtke. Google to Pay $90M in `Click Fraud' Case. Washington Post Magazine, March 9 2006. [30] M. Liedtke. Yahoo Settles `Click Fraud' Lawsuit. MSNBC News, June 28 2006. [31] E. Loukakis. A New Backtracking Algorithm for Generating the Family of Maximal Independent Sets of a Graph. Computers & Mathematics with Applications, 9(4):583­589, 1983. [32] C. Mann. How Click Fraud Could Swallow the Internet. Wired Magazine, January 2006. [33] R. McGann. Study: Consumers Delete Cookies at Surprising Rate. ClickZ News, March 14 2005. [34] A. Metwally, D. Agrawal, and A. El Abbadi. Duplicate Detection in Click Streams. In Proceedings of the 14th WWW International World Wide Web Conference, pages 12­21, 2005. [35] A. Metwally, D. Agrawal, and A. El Abbadi. Using Association Rules for Fraud Detection in Web Advertising Networks. In Proceedings of the 31st VLDB International Conference on Very Large Data Bases, pages 169­180, 2005. [36] A. Metwally, D. Agrawal, and A. El Abbadi. Hide and Seek: Detecting Hit Inflation Fraud in Streams of Web Advertising Networks. Technical Report 2006-06, University of California, Santa Barbara, Department of Computer Science, 2006. [37] J. Moon and L. Moser. On cliques in graphs. Israel journal of Mathematics, 3:23­28, 1965. [38] M. Naor and B. Pinkas. Secure and Efficient Metering. In Proceedings EUROCRYPT International Conference on the Theory and Application of Cryptographic Techniques, pages 576­590, 1998. [39] S. Olsen. Click Fraud Roils Search Advertisers. CNET News, March 4 2005. [40] M. Reiter, V. Anupam, and A. Mayer. Detecting Hit-Shaving in Click-Through Payment Schemes. In Proceedings of the 3rd USENIX Workshop on Electronic Commerce, pages 155­166, 1998. [41] R. Shamir, R. Sharan, and D. Tsur. Cluster Graph Modification Problems. Discrete Applied Mathematics, 144(1-2):173­182, 2004. [42] J. Sima and S. Schaeffer. On the NP-Completeness of Some Graph Cluster Measures. In Proceedings of the 32nd SOFSEM Conference on Current Trends in Theory and Practice of Informatics, pages 530­537, 2006. [43] E. Tomita, A. Tanaka, and H. Takahashi. The Worst-Case Time Complexity for Generating All Maximal Cliques. In Proceedings of the 10th COCOON Annual International Conference on Computing and Combinatorics, pages 161­170, 2004. [44] S. Tsukiyama, M. Ide, H. Ariyoshi, and I. Shirakawa. A New Algorithm for Generating All the Maximal Independent Sets. SIAM Journal on Computing, 6(3):505­517, 1977. [45] D. Vise. Clicking To Steal. Washington Post Magazine, page F01, April 17 2005. [46] J. Vitter. External Memory Algorithms and Data Structures: Dealing with Massive Data. ACM Computing Surveys, 33(2):209­271, 2001. [47] T. Zeller Jr. With Each Technology Advance, a Scourge. The New York Times, October 18 2004. 250