SIGIR 2007 Proceedings Session 17: Spam Spam Spam Know your Neighbors: Web Spam Detection using the Web Topology Carlos Castillo1 Debora Donato1 Aristides Gionis1 chato@yahoo-inc.com debora@yahoo-inc.com gionis@yahoo-inc.com 1 Fabrizio Silvestri2 Vanessa Murdock f.silvestri@isti.cnr.it vmurdock@yahoo-inc.com 1 Yahoo! Research Barcelona C/Ocata 1, 08003 Barcelona Catalunya, SPAIN ISTI ­ CNR Via G. Moruzzi 1, 56124 Pisa ITALY presents a significant challenge, as every day more p eople are using search engines more often. Search engine spamming, also known as spamdexing, encompasses malicious attempts to influence the outcome of ranking algorithms, for the purp ose of getting an undeservedly high rank. Obtaining a higher rank is strongly correlated with traffic, and a higher rank often translates to more revenue. Thus, there is an economic incentive for Web site owners to invest on spamming search engines, instead of investing on improving their Web sites. Spamming the Web is cheap, and in many cases, successful. Web spam is not a new problem, and is not likely to b e solved in the near future. According to Henzinger et al. [17] "Spamming has become so prevalent that every commercial search engine has had to take measures to identify and remove spam. Without such measures, the quality of the rankings suffers severely". Web spam damages the reputation of search engines and it weakens the trust of its users [16]. For instance, Eiron et al. [12] ranked 100 million pages using PageRank [23] and found that 11 out of the top 20 were p ornographic pages, which achieved such high ranking through link manipulation, indicating that the PageRank algorithm is highly susceptible to spam. Spamming techniques are so widely known that there have b een even spamming comp etitions (e.g., the contest to rank highest for the query "nigritude ultramarine" [11] among others). From the p ersp ective of the search engine, even if the spam pages are not ranked sufficiently high to annoy users, there is a cost to crawling, indexing and storing spam pages. Ideally search engines would like to avoid spam pages altogether b efore they use resources that might b e used for storing, indexing and ranking legitimate content. Overview of our approach. We start by building an automatic classifier that combines a set of link-based and content-based features. In general, traditional machine learning methods assume that the data instances are indep endent. In the case of the Web there are dep endencies among pages and hosts. One such dep endency is that links are not placed at random and in general, similar pages tend to b e linked together more frequently than dissimilar ones [10]. Such a dep endency holds also for spam pages and hosts: spam tends to b e clustered on the Web. One explanation for this b ehavior is that spam pages often adopt link-based rank-b oosting techniques such as link-farming. These techniques can b e as simple as creating a p ool of pages linking 2 ABSTRACT Web spam can significantly deteriorate the quality of search engine results. Thus there is a large incentive for commercial search engines to detect spam pages efficiently and accurately. In this pap er we present a spam detection system that combines link-based and content-based features, and uses the top ology of the Web graph by exploiting the link dep endencies among the Web pages. We find that linked hosts tend to b elong to the same class: either b oth are spam or b oth are non-spam. We demonstrate three methods of incorp orating the Web graph top ology into the predictions obtained by our base classifier: (i) clustering the host graph, and assigning the lab el of all hosts in the cluster by ma jority vote, (ii) propagating the predicted lab els to neighb oring hosts, and (iii) using the predicted lab els of neighb oring hosts as new features and retraining the classifier. The result is an accurate system for detecting Web spam, tested on a large and public dataset, using algorithms that can b e applied in practice to large-scale Web data. Categories and Sub ject Descriptors: H.4.m [Information Systems Applications]: Miscellaneous General Terms: Algorithms, Measurement. Keywords: Link spam, Content spam, Web spam 1. INTRODUCTION Traditional information retrieval algorithms were develop ed for relatively small and coherent document collections such as newspap er articles or b ook catalogs in a library. Very little, if any, of the content in such systems could b e describ ed as "spam." In comparison to these collections, the Web is massive, changes more rapidly, and is spread over geographically distributed computers [2]. Distinguishing b etween desirable and undesirable content in such a system Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. Copyright 2007 ACM 978-1-59593-597-7/07/0007 ...$5.00. 423 SIGIR 2007 Proceedings Session 17: Spam Spam Spam Figure 1: Graphical depiction of the hostgraph (undirected), prunned to include only labeled nodes with a connection of over 100 links between them. Black nodes are spam, white nodes are non-spam. Most of the spammers in the larger connected component are clustered together (upper-right end of the center portion). Most of the other connected components are single-class (either only spam nodes, or only non-spam nodes). to a page whose rank is to b e raised. In practice spammers use sophisticated structures that are difficult to detect. We investigate techniques that exploit the connections b etween spam hosts in order to improve the accuracy of our classifiers. We assume that hosts that are well-linked together are likely to have the same class lab el (spam or nonspam). More generally, we can assume that two hosts in the same class should b e connected by short paths going mostly through hosts in the same class. Figure 1 shows a visualization of the host graph in the Web spam collection we are using (describ ed in Section 3). An edge b etween two hosts is shown only if there are at least 100 links b etween the two hosts. In the figure, black nodes are spam and white nodes are non-spam. The layout of the nodes in the figure was computed using a spring model. For the larger connected comp onent of the graph, we can see that spam nodes tend to b e clustered together (in the upp er right corner of the central group of nodes of Figure 1). For the nodes that are not connected to this larger connected comp onent (or are connected by links b elow the threshold), we can see that most of the groups are either exclusively spam, or exclusively non-spam. Main contributions: ­ To the b est of our knowledge this is the first pap er that integrates link and content features for building a system to detect Web spam. ­ We investigate the use of a cost sensitive classifier to exploit the inherent inbalance of lab els: in the dataset we use, most of the Web content is not spam. ­ We demonstrate improvements in the classification accuracy using dep endencies among lab els of neighb oring hosts in the Web graph. We incorp orate these dep endencies by means of clustering and random walks. ­ We apply stacked graphical learning [8] to improve the classification accuracy, exploiting the link structure among hosts in an efficient and scalable way. The rest of this pap er is organized as follows. Section 2 describ es the previous work on Web Spam Detection. In Section 3 we discuss the dataset and exp erimental framework. Section 4 describ es the features we extract. In Section 5 we present the classification algorithms and prop ose methods to improve their accuracy. Section 6 shows how to improve the classification accuracy by exploiting the graph top ology. 424 SIGIR 2007 Proceedings Session 17: Spam Spam Spam Finally, Section 7 presents our conclusions and discusses future work on spam detection. 2. PREVIOUS WORK Previous work on Web spam detection has focused mostly on the detection of three typ es of Web spam: link spam, content spam, and cloaking. Link spam consists of the creation of a link structure, usually a tightly knit community of links, aimed at affecting the outcome of a link-based ranking algorithm. Methods for the detection of link-based spam rely on automatic classifiers (e.g., [4]), propagating trust or distrust through links (e.g., [13]), detecting anomalous b ehavior of link-based ranking algorithms [30], removing links that look suspicious for some reason (e.g., [9]), or using "bursts" of linking activity as a suspicious signal [25]. Content spam is done by maliciously crafting the content of Web pages [14], for instance, by inserting keywords that are more related to p opular query terms than to the actual content of the pages. Methods for detecting this typ e of spam use classifiers [22] or look at language model disagreement [21]. To some extent, these techniques overlap with some of the methods used in e-mail spam filtering. Cloaking consists of sending different content to a search engine than to the regular visitors of a web site (e.g., [27]). The version of the page that is sent to the search engine usually includes content spam, and can b e detected using content-spam detection methods, or by comparing the indexed version of a page to the page that users actually see. The link connections among spam pages has b een used for propagating the "non-spam" lab el in TrustRank [13], or for propagating the "spam" lab el in BadRank [28], or for propagating b oth [29, 6]. In contrast, the detection of Web spam presented in this pap er is based on smoothing the predictions obtained by a classification system, and not on propagating the lab els themselves. In the machine-learning literature there are many pap ers on using lab el-propagation methods (e.g., [19, 20]). Recently, some researchers have applied regularization methods for improving the accuracy of webpage classification tasks, [24, 31, 1]. For instance, Angelova and Weikum [1] improve a text classifier by exploiting link dep endencies and weighting links according to text similarity. Lab el-propagation methods have b een tested in collections in the order of thousands of nodes, but they cannot b e used directly to Web-scale data, b ecause they need to keep the graph in main memory during the smoothing phase, which is not feasible in practice. We only use smoothing methods that can work with the graph represented as a stream in secondary memory, and can b e used with Web datasets of any size. We show that even when we avoid using sophisticated regularization techniques for the sake of scalability, we still can obtain significant improvements in the classification accuracy of the base system. This reference collection is tagged at the host level by a group of volunteers. The assessors lab eled hosts as "normal", "b orderline" or "spam", and were paired so that each sampled host was lab eled by two p ersons indep endently. For the ground truth, we used only hosts for which the assessors agreed, plus the hosts in the collection marked as non-spam b ecause they b elong to sp ecial domains such as police.uk or .gov.uk. The b enefit of lab eling hosts instead of individual pages is that a large coverage can b e obtained, meaning that the sample includes several typ es of Web spam, and the useful link information among them. Since ab out 2,725 hosts were evaluated by at least two assessors, a tagging with the same resources at the level of pages would have b een either completely disconnected (if pages were sampled uniformly at random), or would have had a much smaller coverage (if a sub set of sites were picked at the b eginning for sampling pages). On the other hand, the drawback of the host-level tagging is that a few hosts contain a mix of spam/non-spam content, which increases the classification errors. Domainlevel tagging would have b een another option and should b e explored in future work. For the content analysis, a summary of the content of each host was obtained by taking the first 400 pages reachable by breadth-first search. The summarized sample contains 3.3 million pages. All of the content data used in the rest of this pap er were extracted from a summarized version of the crawl. Note that the assessors sp ent on average 5 minutes p er host, so the vast ma jority of the pages they insp ected were contained in the summarized sample. 3.2 Framework The foundation of our spam detection system is a costsensitive decision tree. We found this classification algorithm to work b etter than all other methods we tried. The features used to learn the tree were derived from a combined approach based on link and content analysis to detect different typ es of Web spam pages. Most of features we used were previously presented in [4, 22], but we b elieve that ours is the first attempt to combine b oth link-based and content-based features. The features used to build the classifiers are presented in Section 4. After the initial classification, discussed in Section 5 we applied several smoothing techniques, presented in Section 6. Evaluation. The evaluation of the overall process is based on a set of measures commonly used in Machine Learning and Information Retrieval [3]. Given a classification algorithm C , we consider its confusion matrix: Prediction Non-spam Spam a b c d True Lab el Non-spam Spam 3. DATASET AND FRAMEWORK 3.1 Data set We use the WEBSPAM-UK2006 dataset, a publicly available Web spam collection [7]. It is based on a crawl of the .uk domain done in May 2006, including 77.9 million pages and over 3 billion links in ab out 11,400 hosts. where a represents the numb er of non-spam examples that were correctly classified, b represents the numb er of nonspam examples that were falsely classified as spam, c represents the spam examples that were falsely classified as nonspam, and d represents the numb er of spam examples that were correctly classified. We consider the following measures: true p ositive rate (or recall), false p ositive rate and d F-measure. The recall R is c+d . The false p ositive rate is 425 SIGIR 2007 Proceedings Session 17: Spam Spam Spam b PR defined as: b+a . The F-measure is defined as F = 2 P +R , d where P is the precision P = b+d . For evaluating the classification algorithms, we focus on the F-measure F as it is a standard way of summarizing b oth precision and recall. We also rep ort the true p ositive rate and false p ositive rate as they have a direct interpretation in practice. The true p ositive rate R is the amount of spam that is detected (and thus deleted or demoted) by the search engine. The false p ositive rate is the fraction of non-spam ob jects that are mistakenly considered to b e spam by the automatic classifier. Cross-validation. All the predictions rep orted in the pap er were computed using tenfold cross validation. For each classifier we rep ort the true p ositives, false p ositives and F-measure. A classifier whose prediction we want to estimate, is trained 10 times, each time using the 9 out of the 10 partitions as training data and computing the confusion matrix using the tenth partition as test data. We then average the resulting ten confusion matrices and estimate the evaluation metrics on the average confusion matrix. The clustering and propagation algorithms in Sections 6.2 and 6.3 op erate on lab els assigned by a base classifier. To avoid providing these algorithms with an oracle, by passing along the true lab els we do the following: (1) For each unlabeled data instance we pass to the clustering and propagation algorithms the lab el predicted by the base classifier. (2) For each labeled data instance we pass the lab el predicted by the baseline classifier when the data instance was in the test partition. the spam mass of a page, i.e., the amount of PageRank received from a spammer. The p erformance of TrustRank dep ends on the seed set, in our case we used 3,800 nodes chosen at random from the Op en Directory Pro ject, excluding those that were lab eled as spam. We observed that the relative non-spam mass for the home page of each host (the ratio b etween the TrustRank score and the PageRank score) is a very effective measure for separating spam from non-spam hosts. However, using this measure alone is not sufficient for building an automatic classifier b ecause it would yield a high numb er of false p ositives (around the 25%). Truncated PageRank. Becchetti et al. [5] describ ed Truncated PageRank, a variant of PageRank that diminishes the influence of a page to the PageRank score of its neighb ors. Estimation of supporters. Given two nodes x and y , we say that x is a d-supporter of y , if the shortest path from x to y has length d. Let Nd (x) b e the set of the d-supp orters of page x. An algorithm for estimating the set Nd (x) for all pages x based on probabilistic counting is describ ed in [5]. For each page x, the cardinality of the set Nd (x) is an in0.45 0.40 0.35 0.30 0.25 0.20 0.15 0.10 0.05 0.00 1.12 1.31 1.53 1.78 2.08 2.43 2.84 3.31 3.87 4.52 Normal Spam 4. FEATURES We extract link-based features from the Web graph and host graph, and content-based features from individual pages. For the link-based features we follow Becchetti et al. [4], and for the content-based features, Ntoulas et al. [22]. Figure 2: Histogram of the minimum ratio change of the # of neighbors from distance i to distance i - 1 creasing function with resp ect to d. A measure of interest is the bottleneck number bd (x) of page x, which we define to b e bd (x) = minj d {|Nj (x)|/|Nj -1 (x)|}. This measure indicates the minimum rate of growth of the neighb ors of x up to a certain distance. We exp ect that spam pages form clusters that are somehow isolated from the rest of the Web graph and they have smaller b ottleneck numb ers than nonspam pages. Figure 2 shows a histogram of b4 (x) for spam and non-spam pages. For most of the non-spam pages, the b ottleneck numb er is around 2.2, while for many of the spam pages it is b etween 1.3 and 1.7. 4.1 Link-based features Most of the link-based features are computed for the home page and the page in each host with the maximum PageRank. The remainder of link features are computed directly over the graph b etween hosts (obtained by collapsing together pages of the same host). Degree-related measures. We compute a numb er of measures related to the in-degree and out-degree of the hosts and their neighb ors. In addition, we consider various other measures, such as the edge-reciprocity (the numb er of links that are reciprocal) and the assortativity (the ratio b etween the degree of a particular page and the average degree of its neighb ors). We obtain a total of 16 degree-related features. PageRank. PageRank [23] is a well known link-based ranking algorithm that computes a score for each page. We compute various measures related to the PageRank of a page and the PageRank of its in-link neighb ors. We obtained a total of 11 PageRank-based features. TrustRank. Gy¨ngyi et al. [15] introduced the idea that if o a page has high PageRank, but it does not have any relationship with a set of known trusted pages then it is likely to b e a spam page. TrustRank [15] is an algorithm that, starting from a subset of hand-picked trusted nodes and propagating their lab els through the Web graph, estimates a TrustRank score for each page. Using TrustRank we can also estimate 4.2 Content-based features For each web page in the data set we extract a numb er of features based on the content of the pages. We use most of the features rep orted by Ntoulas et al. [22], with the addition of new ones such as the entropy (see b elow), which is meant to capture the compressibility of the page. Ntoulas et al. [22] use a set of features that measures the precision and recall of the words in a page with resp ect to the set of the most p opular terms in the whole web collection. Motivated by this idea, we add a new set of features that measures the precision and recall of the words in a page with resp ect to the q most frequent terms from a query log, where q = 100, 200, 500, 1000. Number of words in the page, number of words in the title, average word length. For these features we 426 SIGIR 2007 Proceedings Session 17: Spam Spam Spam count only the words in the visible text of a page, and we consider words consisting only of alphanumeric characters. Fraction of anchor text. Fraction of the numb er of words in the anchor text to the total numb er of words in the page. Fraction of visible text. Fraction of the numb er of words in the visible text to the total numb er of words in the page, include html tags and other invisible text. Compression rate. We compress the visible text of the page using bzip. Compression rate is the ratio of the size of the compressed text to the size of the uncompressed text. Corpus precision and corpus recall. We find the k most frequent words in the dataset, excluding stopwords. We call corpus precision the fraction of words in a page that app ear in the set of p opular terms. We define corpus recal l to b e the fraction of p opular terms that app ear in the page. For b oth corpus precision and recall we extract 4 features, for k = 100, 200, 500 and 1000. Query precision and query recall. We consider the set of q most p opular terms in a query log, and query precision and recall are analogous to corpus precision and recall. Our intuition is that spammers might use terms that make up p opular queries. As with corpus precision and recall, we extract eight features, for q = 100, 200, 500 and 1000. Independent trigram likelihood. A trigram is three consecutive words. Let {pw } b e the probability distribution of trigrams in a page. Let T = {w} b e the set of all trigrams in a page and k = |T (p)| b e the numb er of distinct trigrams. Then the indep endent trigram likelihood is a measure of the indep endence of the distribution of trigrams. It is defined P 1 in Ntoulas et al. [22] to b e - k wT log pw . Entropy of trigrams. The entropy is another measure of the compressibility of a page, in this case more macroscopic than the compressibility ratio feature b ecause it is computed on the distribution of trigrams. The entropy of the distribuP tion of trigrams, {pw }, is defined as H = - wT pw log pw . The ab ove list gives a total of 24 features for each page. In general we found that, for this dataset, the content-based features do not provide as good separation b etween spam and non-spam pages as for the data set used in Ntoulas et al. [22]. For example, we found that in the dataset we are using, the distribution of average word length in spam and non-spam pages were almost identical. In contrast, for the data set of Ntoulas et al. [22] that particular feature provides very good separation. The same is true for many of the other content features. Some of the b est features (judging only from the histograms) are the corpus precision and query precision, which is shown in Figure 3. 0.12 0.10 0.08 0.06 0.04 0.02 0.00 0.0 Normal Spam 4.3 From page features to host features In total, we extract 140 features for each host (describ ed in Section 4.1) and 24 features for each page (describ ed in Section 4.2). We then aggregate the content-based features of pages in order to obtain content-based features for hosts. Let h b e a host containing m web pages, denoted by the set P = {p1 , . . . , pm }. Let p denote the home page of host h b and p denote the page with the largest PageRank among all pages in P . Let c(p) b e the 24-dimensional content feature vector of page p. For each host h we form the content-based feature vector c(h) of h as follows c(h) = c(b), c(p ), E[c(p)], Var[c(p)] . p Here E[c(p)] is the average of all vectors c(p), p P , and Var[c(p)] is the variance of c(p), p P . Therefore, for each host we have 4 × 24 = 96 content features. In total, we have 140 + 96 = 236 link and content features. In the process of aggregating page features, we ignore hosts h for which the home page p or the maxPR page p b is not present in our summary sample. This leaves us with a total of 8,944 hosts, out of which 5,622 are lab eled; from them, 12% are spam hosts. 5. CLASSIFIERS We used as the base classifier the implementation of C4.5 (decision trees) given in Weka [26]. Using b oth link and content features, the resulting tree used 45 unique features, of which 18 are content features. In the data we use, the non-spam examples outnumb er the spam examples to such an extent that the classifier accuracy improves by misclassifying a disprop ortionate numb er of spam examples. At the same time, intuitively, the p enalty for misclassifying spam as normal is not equal to the p enalty for misclassifying normal examples as spam. To minimize the misclassification error we used a cost-sensitive decision tree. We imp osed a cost of zero for correctly classifying the instances, and set the cost of misclassifying a spam host as normal to b e R times more costly than misclassifying a normal host as spam. Table 1 shows the results for different values of R. The value of R b ecomes a parameter that can b e tuned to balance the true p ositive rate and the false p ositive rate. In our case, we wish to maximize the F-measure. Incidentally note that R = 1 is equivalent to having no cost matrix, and is the baseline classifier. Table 1: Cost-sensitive decision tree Cost ratio (R) 1 10 20 30 True p ositive rate 64.0% 68.0% 75.6% 80.1% False p ositive rate 5.6% 6.8% 8.5% 10.7% F-Measure 0.632 0.633 0.646 0.642 50 87.0% 15.4% 0.594 0.1 0.2 0.3 0.4 0.5 0.6 Figure 3: Histogram of the query precision in nonspam vs. spam pages for k = 500. We then try to improve the results of the baseline classifier using bagging. Bagging is a technique that creates an ensemble of classifiers by sampling with replacement from the training set to create N classifiers whose training sets contain the same numb er of examples as the original training set, but may contain duplicates. The lab els of the test set are determined by a ma jority vote of the classifier ensemble. In general, any classifier can b e used as a base classifier, and in our case we used the cost-sensitive decision trees describ ed ab ove. Bagging improved our results by reducing 427 SIGIR 2007 Proceedings Session 17: Spam Spam Spam the false-p ositive rate, as shown in Table 2. The decision tree created by bagging was roughly the same size as the tree created without bagging, and used 49 unique features, of which 21 were content features. Table 2: Bagging with a cost-sensitive decision tree Cost ratio (R) 1 10 20 30 50 True p ositive rate 65.8% 66.7% 71.1% 78.7% 84.1% False p ositive rate 2.8% 3.4% 4.5% 5.7% 8.6% F-Measure 0.712 0.703 0.704 0.723 0.692 The results of classification rep orted in Tables 1 and 2 use b oth link and content features. Table 3 shows the contribution of each typ e of feature to the classification. The content features serve to reduce the false-p ositive rate, without diminishing the true p ositive result, and thus improve the overall p erformance of the classifier. The classifier that serves as the foundation for future exp eriments pap er uses bagging with a cost-sensitive decision tree, where R = 30. Table 3: Comparing link and content features Both Link-only Content-only True p ositive rate 78.7% 79.4% 64.9% False p ositive rate 5.7% 9.0% 3.7% F-Measure 0.723 0.659 0.683 (a) Spam nodes in out-links 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.0 0.2 0.4 0.6 0.8 1.0 Out-links of non spam Outlinks of spam (b) Spam nodes in in-links 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 0.0 0.2 0.4 0.6 0.8 1.0 In-links of non spam In-links of spam Figure 4: Histogram of the fraction of spam hosts in the links of non-spam or spam hosts. 6.2 Clustering We use the result of a graph clustering algorithm to improve the prediction obtained from the classification algorithm. Intuitively, if the ma jority of a cluster is predicted to b e spam then we change the prediction for al l hosts in the cluster to spam. Similarly if the ma jority of a cluster is predicted to b e non-spam then we predict that al l hosts in this cluster are non-spam. We consider the undirected graph G = (V , E , w) where V is the set of hosts, w is a weighting function from V × V to integers so that the weight w(u, v ) is equal to the the numb er of links b etween any page in host u and any page in host v , and E is the set of edges with non-zero weight. Ignoring the direction of the links may result in a loss of information for detecting spam, but it drastically simplifies the graph clustering algorithm. We cluster the graph G using the METIS graph clustering algorithm algorithm [18].1 We partition the 11400 hosts of the graph into 1000 clusters, so as to split the graph into many small clusters. We found that the numb er of clusters is not crucial, and we obtained similar results for partitioning the graph in 500 and 2000 clusters. The clustering algorithm can b e describ ed as follows. Let the clustering of G consist of m clusters C1 , . . . , Cm , which form a disjoint partition of V . Let p(h) [0..1] b e the prediction of a particular classification algorithm C so that for each host h a value of p(h) equal to 0 indicates nonspam, while a value of 1 indicates spam. (Informally, we call p(h) the predicted spamicity of host h) For each cluster Cj , j = 1, . . . , m, we compute its average spamicity p(Cj ) = P 1 hCj p(h). Our algorithm uses two thresholds, a lower |Cj | threshold tl and an upp er threshold tu . For each cluster Cj if p(Cj ) tl then all hosts in Cj are marked as non-spam, and p(h) is set to 0 for all h Cj . Similarly, if p(Cj ) tu then all hosts in Cj are marked as spam, and p(h) is set to 1. The results of the clustering algorithm are shown in Table 4. The improvement of the F-measure obtained over classifier without bagging is statistically significant at the 0.05 confidence level; the improvement for the classifier with bagging is much smaller. Note that this algorithm never has access to the true lab els of the data, but uses only predicted lab els, as explained at the end of section 3.2. We considered variations of the ab ove algorithm that change the lab els of only clusters larger than a given size threshold, or of clusters of a small relative cut (the ratio of the weight of edges going out of the cluster with resp ect to weight of edges inside the cluster). However, none of these variations 1 6. SMOOTHING In this section we describ e a different usage of the link structure of the graph than the one presented in Section 4. During the extraction of link-based features, all nodes in the network were anonymous, while in this regularization phase, the identity (the predicted lab el) of each node is known, and imp ortant to the algorithm. 6.1 Topological dependencies of spam nodes Before describing how to use the aggregation of spam hosts to improve the accuracy of spam detection, we provide exp erimental evidence for the following two hyp otheses: Non-spam nodes tend to b e linked by very few spam nodes, and usually link to no spam nodes. Spam nodes are mainly linked by spam nodes. Examining the out-link and the in-link graphs separately, we count the numb er of spam hosts contained in the adjacency list of each one of the hosts. Let SOUT (x) b e the fraction of spam hosts linked by host x out of all lab eled hosts linked by host x. Figure 4(a) shows the histogram of SOUT for spam and non-spam hosts. We see that almost all non-spam hosts link mostly to non-spam hosts. The same is not true for spam hosts, which tend to link b oth spam and non-spam hosts. Similarly, let SIN (x) b e the fraction of spam hosts that link to host x out of all lab eled hosts that link to x. Figure 4(b) shows the histograms of SIN for spam and non-spam hosts. In this case there is a clear separation b etween spam and non-spam hosts. The general trend is that spam hosts are linked mostly by other spam hosts. More than 85% of the hosts have an SIN value of more than 0.7. On the other hand, the opp osite is true for non-spam hosts; more than 75% of the non-spam hosts have an SIN value of less than 0.2. http://glaros.dtc.umn.edu/gkhome/views/metis/ 428 SIGIR 2007 Proceedings Session 17: Spam Spam Spam Table 4: Results of the clustering algorithm Baseline Clustering Without bagging True p ositive rate 75.6% 74.5% False p ositive rate 8.5% 6.8% F-Measure 0.646 0.673 With bagging True p ositive rate 78.7% 76.9% False p ositive rate 5.7% 5.0% F-Measure 0.723 0.728 in the dataset. Then it generates a set of extra features for each ob ject, by combining the predictions for the related ob jects in the graph. Finally, it adds this extra feature to the input of C , and runs the algorithm again to get new, hop efully b etter, predictions for the data. Let p(h) [0..1] b e the prediction of a particular classification algorithm C as describ ed ab ove. Let r (h) b e the set of pages related to h in some way. We compute: P g r (h) p(g ) f (h) = |r (h)| Next, we add f (h) as an extra feature for instance h in the classification algorithm C , and run the algorithm again. This process can b e rep eated many times, but most of the improvement is obtained with the first iteration. Table 6 rep orts the results of applying stacked graphical learning, by including one extra feature with the average predicted spamicity of r (h). For the set r (h) of pages related to h we use either the in-links, the out-links or b oth. Table 6: Results with stacked graphical learning Avg. Avg. Avg. Baseline of in of out of b oth True p ositive rate 78.7% 84.4% 78.3% 85.2% False p ositive rate 5.7% 6.7% 4.8% 6.1% F-Measure 0.723 0.733 0.742 0.750 We observe that there is an improvement over the baseline, and the improvement is more noticeable when using the entire neighb orhood of the host as an input. The improvement is statistically significant at the 0.05 confidence level. In comparison with the other techniques we studied, this method is able to significantly improve even the classifier with bagging. A second pass of stacked graphical learning yields an even b etter p erformance; the false p ositive rate increases slightly but the true p ositive rate increases by almost 3%, comp ensating for it and yielding a higher F-measure. The feature with the highest information gain is the added feature, and so serves as a typ e of summary of other features. With the added feature, the resulting decision tree is smaller, and uses fewer link features; the tree uses 40 features, of which 20 are content features. Consistently with [8], doing more iterations does not improve the accuracy of the classifier significantly. Table 7: Second pass of stacked graphical learning Baseline First pass Second pass True p ositive rate 78.7% 85.2% 88.4% False p ositive rate 5.7% 6.1% 6.3% F-Measure 0.723 0.750 0.763 The improvement of the F-Measure from 0.723 to 0.763 is of ab out 5.5%, and this actually translates to a large improvement in the accuracy of the classifier. The smoothing techniques we use improve the detection rate from 78.7% to 88.4%, while the error rate grows by less than one p ercentage p oint, from 5.7% to 6.3%. yielded any noticable improvement over the p erformance of the basic algorithm, so we do not discuss them in more detail. We note that the implementation of the clustering algorithm we use might not b e scalable to arbitrarily large web collections. For such data one might want to use more efficient clustering algorithms, for instance, pruning edges b elow a threshold and finding connected comp onents. 6.3 Propagation We use the graph top ology to smooth the predictions by propagating them using random walks, following Zhou et al. [32]. We simulate a random walk starting from the nodes that the base classifier has marked as spam, following a link with probability , and returning to a spam node with probability 1 - . When returning to a spam node, the node to return to is picked with probability prop ortional to its "spamicity" prediction. After this process was run, we used the training part of the data to learn a threshold parameter, and used this threshold to classify the testing part as non-spam or spam. We tried three forms of this random walk: on the host graph, on the transp osed host graph (meaning that the activation goes backwards) and on the undirected host graph. We tried different values of the parameter and got improvements over the baseline with [0.1, 0.3], implying short chains in the random walk. In Table 5 we rep ort on the results when = 0.3, after 10 iterations (this was enough for convergence in this hostgraph). Table 5: Result of applying propagation Baseline Fwds. Backwds. Both Classifier without bagging True p ositive rate 75.6% 70.9% 69.4% 71.4% 6.1% 5.8% 5.8% False p ositive rate 8.5% F-Measure 0.646 0.665 0.664 0.676 Classifier with bagging True p ositive rate 78.7% 76.5% 75.0% 75.2% 5.4% 4.3% 4.7% False p ositive rate 5.7% F-Measure 0.723 0.716 0.733 0.724 As shown in Table 5, the classifier without bagging can b e improved (and the improvement is statistically significant at the 0.05 confidence level), but the increase of accuracy for the classifier with bagging is small. 6.4 Stacked graphical learning Stacked graphical learning is a meta-learning scheme describ ed recently by Cohen and Kou [8]. It uses a base learning scheme C to derive initial predictions for all the ob jects 429 SIGIR 2007 Proceedings Session 17: Spam Spam Spam 7. CONCLUSIONS [6] [7] There is a clear tendency of spammers to b e linked together and this tendency can b e exploited by search engines to detect spam more accurately. There is a lot of related work on spam detection, however, we can compare our results with previous results only indirectly. The reason is that the ma jority of prior research on Web spam has b een done on data sets that are not public. With resp ect to using content only, we note that our set of content-based features includes the the set describ ed in Ntoulas et al. [22]. In their pap er, they rep ort an F-Measure of 0.877 [22, Table 2] using a classifier with bagging. Using essentially the same technique, we obtain a p erformance that is much lower than theirs (our F-Measure is 0.683 in Table 3 compared to their 0.877). This is consistent with the observation presented at the end of Section 4.2, that is, the content of the spam pages in the dataset we use resembles much more closely the content of non-spam pages. Similarly, our link-based features were extracted from a public dataset used previously by Becchetti et al. [4], and again, the accuracy of the same classifier is much lower on their new dataset (our F-Measure is 0.683 compared to the previous F-Measure of 0.879), supp orting the conclusion that distinguishing b etween spam and non-spam is inherently more difficult in the new data. Nevertheless, our b est classifier detects 88.4% of the spam hosts with 6.3% false p ositives. If the error rate is too high for the application or search engine p olicies at hand, it can b e adjusted by adjusting the cost matrix, for more conservative spam filtering (at the exp ense of a lower detection rate). The graph-based algorithms we use op erate over the graph on a stream way, using a small numb er of passes over the Web graph that is kept on secondary memory, and keeping only a small amount of data in main memory. This means that the system we have prop osed can b e used in large Web datasets of any size. The connected nature of spam on the Web graph suggests that the use of regularization is a promising area of future work. While we explored such regularization methods in isolation and we assess their p erformance indep endently, an interesting direction for future work is to combine the regularization methods at hand in order to improve the overall accuracy. [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] Acknowledgements We thank the supp ort of the EU within the 6th Framework Programme under contract 001907 "Dynamically Evolving, Large Scale Information Systems" (DELIS). This work has b een done while Fabrizio Silvestri was visiting the Yahoo! Research Lab oratories in Barcelona. [27] [28] [29] [30] 8. REFERENCES [31] [1] R. Angelova and G. Weikum. Graph-based text classification: learn from your neighb ors. In ACM SIGIR, pages 485­492, 2006. [2] A. Arasu, J. Cho, H. Garcia-Molina, A. Paep cke, and S. Raghavan. Searching the web. ACM Transactions on Internet Technology (TOIT), 1(1):2­43, 2001. [3] R. Baeza-Yates and B. Rib eiro-Neto. Modern Information Retrieval. Addison Wesley, May 1999. [4] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Link-based characterization and detection of Web Spam. In AIRWeb, 2006. [5] L. Becchetti, C. Castillo, D. Donato, S. Leonardi, and R. Baeza-Yates. Using rank propagation and probabilistic [32] counting for link-based spam detection. In ACM WebKDD, Pennsylvania, USA, August 2006. A. Benczur, K. Csalogany, and T. Sarlos. Link-based similarity ´ ´ ´ search to fight web spam. In AIRWeb, 2006. C. Castillo, D. Donato, L. Becchetti, P. Boldi, S. Leonardi, M. Santini, and S. Vigna. A reference collection for web spam. ACM SIGIR Forum, 40(2):11­24, Decemb er 2006. W. W. Cohen and Z. Kou. Stacked graphical learning: approximating learning in markov random fields using very short inhomogeneous markov chains. Technical rep ort, 2006. A. L. da Costa-Carvalho, P.-A. Chirita, E. S. de Moura, P. Calado, and W. Nejdl. Site level noise removal for search engines. In WWW, pages 73­82, New York, NY, USA, 2006. B. D. Davison. Topical lo cality in the web. In ACM SIGIR, pages 272­279, Athens, Greece, 2000. I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: learning to identify link spam. In ECML, volume 3720 of LNAI, pages 233­243, Porto, Portugal, 2005. N. Eiron, K. S. Curley, and J. A. Tomlin. Ranking the web frontier. In WWW, pages 309­318, New York, NY, USA, 2004. Z. Gyongyi and H. Garcia-Molina. Link spam alliances. In ¨ VLDB, 2005. Z. Gyongyi and H. Garcia-Molina. Web spam taxonomy. In ¨ AIRWeb, 2005. Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating ¨ Web spam with TrustRank. In VLDB, 2004. Z. Gyongyi and H. G. Molina. Spam: It's not just for inb oxes ¨ anymore. IEEE Computer Magazine, 38(10):28­34, 2005. M. R. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. ACM SIGIR Forum, 36(2):11­22, 2002. G. Karypis and V. Kumar. Multilevel k-way partitioning scheme for irregular graphs. Journal of Paral lel and Distributed Computing, 48(1):96­129, 1998. Q. Lu and L. Geto or. Link-based classification using lab eled and unlab eled data. In ICML Workshop on The Continuum from Labeled to Unlabeled Data, Washington, DC, 2003. S. A. Macskassy and F. Provost. Suspicion scoring based on guilt-by-asso ciation, collective inference, and fo cused data access. In International Conference on Intel ligence Analysis, 2005. G. Mishne, D. Carmel, and R. Lemp el. Blo cking blog spam with language mo del disagreement. In AIRWeb, 2005. A. Ntoulas, M. Na jork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In WWW, pages 83­92, Edinburgh, Scotland, May 2006. L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: bringing order to the Web. Technical rep ort, Stanford Digital Library Technologies Pro ject, 1998. X. Qi and B. D. Davison. Knowing a web page by the company it keeps. In CIKM, pages 228­237, Arlington, VA, USA, Novemb er 2006. G. Shen, B. Gao, T.-Y. Liu, G. Feng, S. Song, and H. Li. Detecting link spam using temp oral information. In ICDM, Hong Kong, Decemb er 2006. I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 1999. B. Wu and B. D. Davison. Cloaking and redirection: A preliminary study. In AIRWeb, 2005. B. Wu and B. D. Davison. Identifying link farm spam pages. In WWW, pages 820­829, New York, NY, USA, 2005. B. Wu, V. Go el, and B. D. Davison. Propagating trust and distrust to demote web spam. In MTW, May 2006. H. Zhang, A. Go el, R. Govindan, K. Mason, and B. Van Roy. Making eigenvector-based reputation systems robust to collusion. In WAW, volume 3243 of LNCS, pages 92­104, Rome, Italy, 2004. Springer. T. Zhang, A. Pop escul, and B. Dom. Linear prediction mo dels with graph regularization for web-page categorization. In ACM KDD, pages 821­826, New York, NY, USA, 2006. D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf. Learning with lo cal and global consistency. Advances in Neural Information Processing Systems, 16:321­328, 2004. 430