Analysis of a Low-dimensional Linear Model under Recommendation Attacks
Sheng Zhang, Yi Ouyang, James Ford, Fillia Makedon
Depar tment of Computer Science Dar tmouth College
{clap,

ouyang, jford, makedon}@cs.dar tmouth.edu
and preferences liked in the past. Generally, CF techniques can be divided into two categories: memory-based algorithms and model-based algorithms. Memory-based algorithms, which are either user-based or item-based, first identify the top k users or items most similar to the active user or item and then combine those ratings together to compute predictions. In contrast to memory-based algorithms, model-based algorithms learn a predictive model from rating profiles and use that model to generate predictions. While CF techniques are beneficial to users, they might be vulnerable to recommendation attacks because they are dependent on external sources of information. In these attacks (which have been termed shil ling attacks [10]), attackers influence a recommendation system in a manner advantageous to them by introducing biased rating profiles. Because recommendation systems are widely used in the realm of ecommerce, there is a natural motivation for producers of items (manufacturers, authors, etc.) to use these shilling attacks so that their items are recommended to users more often. Therefore, recommendation system operators should be concerned about these attacks when they design and implement systems. Recent research [4, 10, 12, 13] has identified and examined the vulnerabilities of memory-based CF techniques in the face of recommendation attacks. O'Mahony et al. [13] first performed empirical studies of a user-based CF algorithm under shilling attacks and showed that attacks are successful at affecting predictions. Lam and Riedl [10] further evaluated the impact of attacks by including an item-based CF algorithm, and their study suggested that the item-based approach is much less affected by these attacks. More recently, Burke et al. [4] and Mobasher et al. [12] proposed several new attack models that need little knowledge of the specific system being targeted to have a strong likelihood of success in attacks on both user-based and item-based algorithms. To the best of our knowledge, all previous work considered only memory-based CF algorithms when exploring vulnerabilities of recommendation systems, and no empirical study has been conducted on model-based algorithms. Since model-based algorithms have different mechanisms, it is important to know how well the attacks that are effective against memory-based algorithms can affect model-based algorithms. In this paper, we focus on a popular model-based algorithm, a Singular Value Decomposition (SVD) based CF algorithm used in [1, 2, 5, 6, 16, 19, 20], which assumes that the rating matrix can be well described by a low-dimensional linear model and then computes the best model that maximizes the log-likelihood of ratings.

ABSTRACT
Collaborative filtering techniques have become popular in the past decade as an effective way to help people deal with information overload. Recent research has identified significant vulnerabilities in collaborative filtering techniques. Shilling attacks, in which attackers introduce biased ratings to influence recommendation systems, have been shown to be effective against memory-based collaborative filtering algorithms. We examine the effectiveness of two popular shilling attacks (the random attack and the average attack) on a model-based algorithm that uses Singular Value Decomposition (SVD) to learn a low-dimensional linear model. Our results show that the SVD-based algorithm is much more resistant to shilling attacks than memory-based algorithms. Furthermore, we develop an attack detection method directly built on the SVD-based algorithm and show that this method detects random shilling attacks with high detection rates and very low false alarm rates.

Categories and Subject Descriptors
H.3.5 [Information Storage and Retrieval]: Online Information Services--Commercial services ; K.4.4 [Computers and So ciety]: Electronic Commerce--Security

General Terms
Experimentation, Security

Keywords
Collaborative filtering, recommender systems, shilling attacks, anomaly detection

1. INTRODUCTION
Recommendation systems help people deal with information overload by recommending products or services from a large number of candidates. Collaborative Filtering (CF) is one popularly used recommendation technique. It recommends to a user the items that people with similar tastes

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00.

517


We perform a series of experiments to quantitatively evaluate the effect of two popular recommendation attacks (the random attack and the average attack) on the SVD-based algorithm, and compare that with effects of attacks on memorybased algorithms. Our results demonstrate that the SVDbased algorithm is much less affected by attacks. We further propose a method built on the SVD-based algorithm to detect recommendation attacks. Our results demonstrate that this method is effective at detecting random attacks (though not average attacks) with high detection rates and low false alarm rates.

to a rating matrix is that observed ratings Ai,j are combinations of ratings from a low-dimensional linear model (denoted as X ) and Gaussian noise (with zero mean). That is, Ai,j = Xi,j + Zi,j , with Zi,j i.i.d.  N (0,  2 ). Since the rating matrix in the real world is incomplete and sparse, Srebro and Jaakkola [19] proposed an ExpectationMaximization (EM) algorithm to maximize the log-likelihood of all observed ratings Ao , that is log Pr(Ao |X ). In this paper, we use this SVD-based algorithm in which the EM algorithm is incorporated. The details of the derivation of this EM algorithm can be found in [19, 20]; an overview is given below. In the Expectation step of the tth iteration, a filled-in ma( t) trix A(t) is formed where unobserved entries Ai,j are equal to the corresponding values of the computed linear model ( t- 1) in the previous iteration (Xi,j ) and observed entries are unchanged from A. That is,
( t) Ai,j

2. IMPACT OF ATTACKS
In this section, we examine how well the common shilling attacks that are effective against memory-based algorithms operate with the SVD-based algorithm. We compare the impact of recommendation attacks on the SVD-based algorithm and two memory-based (user-based [8] and item-based [17]) algorithms.

2.1 CF algorithms
We now give more details on the three tested CF algorithms.

=

A

Xi,j

i,j ( t- 1)

if Ai,j is rated otherwise.

2.1.1

User-based Algorithm

The user-based CF algorithm [8, 15] first computes the correlations between users using a mean-adjusted Pearson correlation, and then combines a weighted average of the k nearest neighbors' ratings to produce a prediction. More precisely, a predicted rating Pi,j for user i on item j is computed by Pi,j = Ai +

In the following Maximization step, we perform SVD on this filled-in matrix A(t) to get A(t) = U S V T . The updated linear model X (t) is computed as
T X (t) = Uk Sk Vk ,

È

k u=1

w È

i,u · (Au,j - k u=1 |wi,u |

Au )

,

where Au,j is user u's rating on item j , Ai is user i's average rating, and wi,u is the correlation between users i and u. We use an optimized version suggested in [8]. We set k to 20 and use n/50 significance weighting.1 A similarity threshold of 0.1 is used, so only positive similarities are considered. This parameterization is exactly the same as that in [10].

where Uk , Sk , and Vk are matrices composed of the top k left singular vectors, singular values, and right singular vectors, respectively. The above EM procedure is ensured to converge, which means that the log-likelihood of all observed ratings given the current model estimate is always nondecreasing. After the EM procedure finishes, a prediction is computed as the corresponding entry in the final computed model, i.e., Pi,j = Xi,j . In this paper, the rank of linear model X is set to 10 and the number of iterations in the EM algorithm is set to 10. Taking into account the differences in rating scale between different users, we initially normalize each entry for user i in a rating matrix by first subtracting user i's rating average from it and then dividing the difference by the standard deviation of user i's ratings. In the first iteration of the EM procedure, unrated entries are replaced with zero. Finally, predictions are denormalized to the original rating range.

2.1.2

Item-based Algorithm

The item-based algorithm [17] computes and uses similarities between items rather than users. The formula used to compute a prediction is: Pi,j =

È È

k v =1 sv ,j · Ai,v k v =1 |sv ,j |

.

2.2

Attack Experimental design

Here, sv,j is the similarity between items v and j . Our implementation uses the adjusted cosine method to compute similarities. The number of neighbors is set to 20 and only positive similarities are considered. The parameterization of this version is also the same as that in [10].

2.1.3

SVD-based Algorithm

SVD was first introduced into recommendation systems in [2] and [16]. The underlying assumption of applying SVD If two users have fewer than 50 commonly rated items, a significance weight of n/50 is used, where n is the number of co-rated items. Otherwise, the significance weight is 1.
1

Our experimental design builds on the previous work in [10]. The recommendation attacks considered are shill attacks where the attackers introduce a new set of shill users and a set of ratings made by those shill users to the recommendation system. The two intents of attacks that are considered are to "push" a set of items so that they are recommended to more users, and to "nuke" a set of items so that they are recommended to fewer users. A MovieLens data set consisting of about 1 million ratings on 3706 items by 6040 users is used in all experiments. Each user gives ratings to at least 20 items. Ratings are discrete-valued between 1 and 5. The number of new shill users introduced to the data set is varied between 20 and 200. To maximize the number of items in common between

518


Table 1: Prop erties (numb er of ratings and average rating) of items in selected target set.
# ratings mean 7 2.57 14 3.00 23 3.30 33 3.97 44 1.32 58 3.00 74 4.20 97 3.64 123 2.79 153 2.24 187 4.32 225 3.44 278 3.72 342 1.61 426 4.02 550 2.82 722 3.47 1045 2.93 1331 3.50 2653 4.34

shill users and real users, each shill user gives ratings to all movies. We manually selected 20 movies for our target set (see Table 1); this selection of items represents a wide range of popularity (number of ratings) and likability (mean rating). Two common attack models, random attack and average attack, are used in our experiments. In the random attack model, each shill user gives random values from a normal distribution to items not in the target set and rates items in the target set with a value equal to either the maximum or minimum allowed rating depending on its intention (push or nuke, respectively). A normal distribution with mean 3.6 and standard deviation 1.1 is used because these two values represent the ratings distribution in the original data set. Note that randomly generated values are always rounded to allowed ratings. The average attack model used here is similar, but the ratings for items not in the target set are set randomly from a normal distribution with mean equal to the average rating of the item being rated and standard deviation 1.1. The intuition of the average attack model is to make inserted bots more similar to existing users and thus have a stronger attacking effect.

to see that the SVD-based algorithm also has much smaller absolute changes in ExpTop40 if we multiply the percentage changes by the original ExpTop40 values (without attack). Overall, the above results demonstrate that the SVD-based CF algorithm is much more resistant to the random and average shilling attacks than the memory-based CF algorithms.

2.5

Discussion

2.3 Metrics
To measure how effective an attack is in accomplishing its goal, we use two metrics proposed in [10] and [13]: Prediction Shift and Expected Top-N Occupancy (ExpTopN). Prediction shift is defined as the average change in prediction toward some target values (the maximum or the minimum allowed rating, respectively, for a push or nuke attack) over all users and target items. ExpTopN is defined as the expected number of occurrences of all target items in a top-N recommendation list, measured over all users, assuming that the displayed ordering of items tied at any particular rank is random. As indicated in [10], the median recommendation search by MovieLens users ends with at most the first 40 items displayed. Therefore, we use ExpTop40 to reflect actual MovieLens usage.

2.4 Attack Experimental results
A total of 48 experiments were performed in a 3 × 2 × 2 × 4 test matrix. The algorithm (user-based, item-based, or SVD-based), intent (push or nuke), attack model (random or average), and the number of shill users/bots (20, 50, 100, 200) were varied in each experiment. Twenty trials were tested for each experiment and mean values are reported. Table 2 lists the obtained results measured by PredShift and the percentage change in ExpTop40. The original ExpTop40 values (without attacks) for these three CF algorithms are 0.4012 (user-based), 0.2647 (item-based), and 0.5536 (SVD-based). Compared with the results in [10], the results obtained from the user-based and the item-based algorithms have similar patterns. The former responds very strongly to all attacks while the latter responds less strongly. The SVD-based algorithm has the smallest change in both PredShift and ExpTop40 for all attacks. The largest change in PredShift is no more than 0.003 and the largest percentage change in ExpTop40 is no more than 4%. It is easy

For the item-based algorithm, there are two noteworthy points. First, random nuke attacks can actually push target items in both metrics. This unexpected observation was also reported in [10]. Second, average push attacks may nuke target items slightly in ExpTop40, although the item ratings are shifted greatly toward the maximum rating in PredShift. The inconsistency of the results in these two metrics implies that average push attacks may unintentionally push items not in the target set. We suppose that the above two unexpected observations might be caused by the experimental design and the characteristics of the item-based algorithm. Note that in the item-based algorithm, the prediction results corresponding to an item will be changed only when the nearest neighbor items of this item become different. In the current design, all 20 target items are pushed or nuked in each added bot. It is likely that two or more target items go into (or go out of ) the nearest neighbors of a certain item (no matter it is in the target set or not). Since likabilities (mean ratings) of the target items range widely, the predictions for this item may change unintentionally. Taking into account this assumption, we propose a slightly different experimental design. In each experiment, only one target item is chosen and then pushed or nuked in the injected bots; the mean prediction and ExpTop40 occupancy corresponding to this target item is recorded. After conducting 20 experiments, one for each target item, we compute PredShift and the percentage change in ExpTop40. Listed in Table 3 are the results of the item-based algorithm using this new experimental design. Obtained results for the item-based algorithm are significantly different from what is listed in Table 2 except for the group of average nuke attacks. Although it is hard to tell which experimental design is more ob jective, the inconsistency of the results from them suggests that the effect of attacks on the item-based algorithm is highly dependent on the number of target items in attacking bots. We note that the results for the userbased algorithm and the SVD-based algorithm in this new experiment (not shown) comply with the results in Table 2 very well.

3.

DETECTING ATTACKS

One important question regarding shilling attacks in recommendation systems is whether attacks can be detected. In this section, we propose an attack detection method that exploits the low-dimensional linear model computed from the SVD-based CF algorithm and verify its effectiveness in MovieLens.

519


Table 2: Effect of attacks measured by prediction shift (PredShift) and p ercent change in exp ected top-40 o ccupancy (ExpTop40). For PredShift, a larger value means the attack is more effective, and for ExpTop40, a larger absolute value means the attack is more effective. The original ExpTop40 values (without attacks) are 0.4012 (user-based), 0.2647 (item-based), and 0.5536 (SVD-based). User-based Item-based SVD-based Intent Attack Bots PredShift ExpTop40 PredShift ExpTop40 PredShift ExpTop40 20 0.198 195% 0.004 5% 0.002 4% 50 0.315 542% 0.006 6% 0.001 3% Random 100 0.418 920% 0.032 30% 0.000 2% 200 0.517 1305% 0.062 91% 0.000 1% Push 20 0.486 518% 0.307 1% 0.002 4% 50 0.664 971% 0.432 -9% 0.002 4% Average 100 0.778 1290% 0.508 -8% 0.001 3% 200 0.864 1544% 0.559 -8% 0.001 1% 20 0.248 -18% -0.028 3% 0.001 3% 50 0.396 -8% -0.041 17% 0.002 2% Random 100 0.531 -6% -0.070 61% 0.002 1% 200 0.669 -7% -0.104 161% 0.003 0% Nuke 20 0.454 -47% 0.459 -7% 0.001 2% 50 0.649 -51% 0.586 -34% 0.001 2% Average 100 0.790 -54% 0.612 -63% 0.002 1% 200 0.910 -57% 0.630 -80% 0.003 0% In this paper, the considered model is a low-dimensional linear model. It is argued in [5] that such a model is well able to describe user preferences, which explains why this model has been popular for CF algorithms. A low-dimensional model X can be computed from all observed ratings (including real ratings and biased ratings) using the SVD-based algorithm in Section 2.1.3.2 Recall that the SVD-based algorithm used computes an X that maximizes the log-likelihood of all observed ratings. Therefore, the computed model X is actually the linear model that maximizes the sum of the degrees of belief in all observed ratings. After X is computed, we can compute the degree of belief in any observed rating and any user's rating profile. Recall that Ai,j is i.i.d.  N (Xi,j ,  2 ). The degree of belief in a rating, log Pr(Ai,j |X ), can be derived as follows: log Pr(Ai,j |X ) = log Pr(Ai,j |Xi,j ) = - 1 (Ai,j - Xi,j )2 + C, 2 2

Table 3: Effect of attacks on the item-based CF algorithm when a single target item is chosen in each exp eriment. Intent Attack Bots PredShift ExpTop40 20 0.002 5% 50 0.001 -6% Random 100 0.024 -12% 200 0.037 -19% Push 20 0.333 14% 50 0.474 22% Average 100 0.562 35% 200 0.616 36% 20 -0.029 -9% 50 -0.030 -26% Random 100 -0.046 -26% 200 -0.076 -31% Nuke 20 0.577 -18% 50 0.748 -39% Average 100 0.805 -68% 200 0.842 -83%

where C is a constant. Subsequently, the degree of belief in user i's rating profile follows as 1 hi

l

hi

3.1 A Probabilistic Approach using SVD
Assume that we have a rating matrix A (users-by-items), and denote X as a model that describes this rating matrix. We first measure how likely it is that a rating is real by defining the degree of belief in a rating: Definition 1. The degree of belief in a rating Ai,j is defined as the log-likelihood of this rating given the model X , that is log Pr(Ai,j |X ). We then measure how likely it is that a user's rating profile is real by defining the degree of belief in a user's rating profile: Definition 2. The degree of belief in user i's rating pro1 file is hi hi 1 log Pr(Ai,jl |X ), where hi is the number of l= items user i has rated and jl is the index of the lth rated item.

log Pr(Ai,jl |X ) = -
=1

Note that hi 1 (Ai,jl - Xi,jl )2 /hi is a weighted average l= of the squared differences between entries in the ith row of A and entries in the ith row of X if rated entries have weight one and unrated entries have weight zero. For ease of expression, we define this term as the belief divergence (D(Ai ||Xi )): D(Ai ||Xi ) =
def

È

1 2 2

È

hi l=1 (Ai,jl

hi

- Xi,jl )2

+ C.

È

hi l=1 (Ai,jl

hi

- Xi,jl )2

.

È

According to the previous equation, user i's rating profile has a higher degree of belief if its belief divergence is smaller,
2 As noted previously, each rated entry in A is initially normalized.

520


1.2

1.2

belief divergence D(Ai||Xi)

belief divergence D(Ai||Xi)

1

real profiles

attacks

attacks
1

real profiles

0.8

0.8

0.6

0.6

0.4 attacks 0.2 0 1000 2000 3000 4000 5000 6000 7000

0.4 attacks 0.2 0 1000 2000 3000 4000 5000 6000 7000

user index (a) random push attack

user index (b) random nuke attack

Figure 1: Plots of the b elief divergences D(Ai ||Xi ) for real rating profiles (left of the vertical line) and attack profiles (right of the line). The value of mean(D(Aj ||Xj )) + 2 · std(D(Aj ||Xj )) is shown as a dashed line. Most attacks are easy to distinguish using this line. and vice versa. Therefore, D(Ai ||Xi ) is used as a metric to tell whether a rating profile is real or manipulated. We consider user i's rating profile to be normal if D(Ai ||Xi )  mean(D(Aj ||Xj )) + c · std(D(Aj ||Xj )), where the mean and the standard deviation are computed over all rating profiles. If the belief divergences D(Ai ||Xi ) of real rating profiles can be well described by a Normal distribution,3 the right side in the above inequality is the threshold for D(Ai ||Xi ) at a confidence level of the Normal cumulative distribution function of the value c. We set c to 2 in this paper, which corresponds to the 97.7% confidence level in a Normal distribution. In Figure 1 we illustrate the effectiveness of this approach in MovieLens when a random attack model is used. Each subfigure shows plots of the belief divergence D(Ai ||Xi ) for each rating profile. The first 6040 rating profiles (on the left of the vertical line) are real rating profiles and the last 100 profiles (on the right) are inserted bots. All inserted bots push or nuke one target item and give random ratings ( N (3.6, 1.12 )) to another 2000 items. This number is chosen because the maximum number of ratings given by a user in MovieLens is 2314. The value of the threshold for D(Ai ||Xi ) is shown graphically as a dashed line. Figure 1 shows how sharply the belief divergence is able to separate random attacks from real rating profiles. It also gives some insight into why (as we will show in the next subsection) our method yields such high detection rates combined with low false alarm rates. As seen in the plots, almost all inserted attack profiles result in values of D(Aj ||Xj ) greater than the threshold, while very few of the real rating profiles yield D(Aj ||Xj ) greater than that. validation is focused on answering two questions: (1) how well can this method detect attack bots? and (2) how does the context of attacks affect the performance of the method? Three metrics that are used are the detection rate, false alarm rate, and Receiver Operating Characteristic (ROC) curve. The detection rate is defined as the number of detected attack bots divided by the number of attack bots. The false alarm rate is defined as the number of normal profiles that are predicted as attacks divided by the number of normal profiles. Note that we assume that all the original 6040 profiles in MovieLens are normal profiles. The ROC curve attempts to measure the extent to which an information filtering system can successfully distinguish between signal and noise (attacks and normal profiles in this case). The ROC area, which measures the area under the ROC curve, is used in our experiments. We first show how well the SVD-based method detects random attack bots. We performed 2 × 4 experiments, in each of which the attack intent (push or nuke) and the number of bots (20, 50, 100, or 200) were varied. For each experiment, there were 20 trials corresponding to 20 target items (one target item in each trial); averages were computed over them for all metrics. Each inserted bot pushes or nukes one target item and also gives ratings to another randomly chosen 2000 items. Table 4 reports detailed results from our experiments. The ROC area is always larger than or equal to 0.9998. The SVD-based method detects at least 90% of the attack bots with a false alarm rate smaller than 0.25%. The table confirms quantitatively that our method is very accurate and precise for random attacks. Note that both the detection rate and the false alram rate drop slightly when the number of bots increases. This is because the mean and the standard deviation of D(Ai ||Xi ) increases (and as a result the threshold increases) when more bots are injected. However, the stable result in ROC area demonstrates that it is still possible to separate the injected bots from the normal profiles in this case.

3.2 Detection Experimental Results
We use the experimental design introduced in Section 2.2 to evaluate the proposed SVD-based detection method. Our
3 By the central limit theorem, if hi is the same for all i, then D(Ai ||Xi ) is asymptotically Normal as hi increases.

521


Table 4: Results of the SVD-based detection metho d in MovieLens when random attacks are injected. Detection can b e done accurately. Intent Bots ROC Detection Rate False Alarm Rate 20 0.9999 ± 0.0001 99.25% ± 1.83% 0.22% ± 0.01% 50 0.9998 ± 0.0002 97.80% ± 1.82% 0.14% ± 0.02% Push 100 0.9999 ± 0.0001 96.95% ± 1.73% 0.07% ± 0.01% 200 0.9999 ± 0.0000 94.20% ± 1.41% 0.02% ± 0.01% 20 0.9999 ± 0.0001 99.25% ± 1.83% 0.22% ± 0.01% 50 0.9998 ± 0.0002 98.00% ± 1.72% 0.15% ± 0.02% Nuke 100 0.9999 ± 0.0000 97.75% ± 1.12% 0.06% ± 0.01% 200 0.9999 ± 0.0000 93.55% ± 1.86% 0.02% ± 0.01%

Table 5: Results of random push attacks (with 100 b ots) for SVD-based detection metho d when the numb er of rated items in each b ot varies. Detection is essentially p erfect (ROC > 0.999) when b ots rate a large numb er ( 500) of items.
#rated 20 50 100 150 500 1000 2000 All ROC 0.9744 ± 0.0150 0.9941 ± 0.0030 0.9977 ± 0.0007 0.9986 ± 0.0005 0.9997 ± 0.0001 0.9999 ± 0.0001 0.9999 ± 0.0001 0.9995 ± 0.0002 Detection 7.45% ± 4.72% 46.35% ± 8.97% 68.10% ± 6.19% 74.95% ± 4.68% 93.10% ± 3.63% 97.10% ± 1.92% 96.95% ± 1.73% 87.75% ± 2.51% False Alarm 0.17% ± 0.02% 0.10% ± 0.01% 0.07% ± 0.01% 0.06% ± 0.00% 0.07% ± 0.00% 0.06% ± 0.01% 0.07% ± 0.01% 0.06% ± 0.02%

Table 6: Results of random push attacks (with 100 b ots) for SVD-based detection metho d when the numb er of target items varies.
#targets 1 2 5 10 20 ROC 0.9999 ± 0.0001 0.9999 ± 0.0001 0.9999 ± 0.0000 0.9997 ± 0.0002 0.9987 ± 0.0002 Detection 96.95% ± 1.73% 97.45% ± 1.32% 97.75% ± 1.37% 92.20% ± 4.50% 65.15% ± 4.61% False Alarm 0.07% ± 0.01% 0.06% ± 0.01% 0.07% ± 0.01% 0.07% ± 0.01% 0.08% ± 0.02%

Table 7: Results of random push attacks (with 100 b ots) for SVD-based detection metho d when the standard deviation of random ratings varies. Detection p erformance is nearly constant.
Std 0.1 0.4 0.7 1.1 ROC 0.9999 ± 0.0001 0.9999 ± 0.0000 0.9999 ± 0.0000 0.9999 ± 0.0001 Detection 97.00% ± 1.81% 97.30% ± 1.59% 98.00% ± 1.26% 96.95% ± 1.73% False Alarm 0.06% ± 0.01% 0.06% ± 0.02% 0.06% ± 0.02% 0.07% ± 0.01%

To answer the second question about how this method works when the attack setting changes, we first vary the number of rated items (filler items) in attack bots. We use a random push attack model with 100 bots in which a single target item is attacked. Results are presented in Table 5, which shows that the ROC area is larger than 0.999 when the number of rated items is larger than or equal to 500. It also shows that injected bots having fewer rated items are harder to detect. An intuitive explanation of this finding is that the chance to observe the inconsistency among ratings becomes smaller at that time. However, when the number of rated items is 50, our method can still get a 46.35% detection rate with a 0.10% false alarm rate. Note that the median and mean numbers of rated items in normal rating profiles are 96 and 165.6, respectively. These two observations verify an important property of our approach: that it depends very little on the number of rated items in normal profiles or attack bots. We next change the number of targets in each bot to determine the effect on the performance of our method. The experimental setting is similar to the above one, except that the number of targets in one experiment varies from 1 to 20. Results are listed in Table 6, which shows that the detection rate drops when more items are targeted. This implies that the belief divergences D(Ai ||Xi ) of injected bots generally become smaller at that time. We finally show the performance of the detection method when the standard deviation of Gaussian distribution used for generating random ratings is changed. Table 7 shows that both the detection rate and the false alarm rate remain fairly constant when the standard deviation becomes smaller. Note that when a small standard deviation is used, almost all random ratings are valued 3 or 4 on the MovieLens

5 point scale. In that case, our method can still separate random attacks from normal profiles very well. While the proposed method is effective at detecting random attacks, we observe that it has a much smaller power to detect average attacks. Table 8 lists the detection results when an average attack model is used with 100 injected bots in which the number of rated items varies. When the number of rated items is around 50, the method still has limited success at separating average attacks from normal profiles. However, when this number is larger, the SVDbased method becomes completely ineffective. We also observe that the detection performance for average attacks can be enhanced if a higher rank of the linear model is chosen. These findings suggest that average attacks are more similar to normal rating patterns and they have more subtle inconsistency among their ratings in a low-dimensional linear pro jection. Although average attacks are harder to detect, there is a higher knowledge cost of mounting them relative to random attacks because the attacker needs to know rating averages for items in the targeted system. To summarize, our results above demonstrate that for a broad range of settings in random attacks, the SVD-based method has high detection rates and low false alarm rates.

3.3

Discussion

An advantage of the proposed detection method is that it is directly built on the SVD-based CF algorithm. Therefore, it requires little extra cost (O(mn) where m and n are

522


Table 8: Results of average push attacks (with 100 b ots) for SVD-based detection metho d when the numb er of rated items in each b ot varies. The metho d has a small p ower of detecting average attacks when the numb er of rated items in each b ot is around 50 and it b ecomes completely ineffective when this numb er b ecomes larger.
#rated 20 50 100 150 500 All ROC 0.9614 ± 0.0207 0.9735 ± 0.0078 0.9582 ± 0.0088 0.9324 ± 0.0121 0.6844 ± 0.0101 0.5141 ± 0.0054 Detection 3.65% ± 2.72% 14.10% ± 5.54% 6.45% ± 2.58% 1.85% ± 1.46% 0 0 False Alarm 0.18% ± 0.01% 0.17% ± 0.01% 0.18% ± 0.01% 0.20% ± 0.01% 0.24% ± 0.01% 0.22% ± 0.02%

0308229. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the National Science Foundation. We thank the anonymous reviewers for their comments, which helped us improve the quality of the paper.

5.

REFERENCES

the number of users and items, respectively) to compute belief divergence if the SVD-based algorithm is already incorporated in a recommendation system. The computational cost of the SVD-based CF algorithm is O(l · (m2 n + mn2 )) if a deterministic SVD computation method is used and is O(l · kmn log(mn)) if a Lanczos approximation method is used to compute the top k singular vectors [7], where l is the number of iterations in the EM procedure. Several fast SVD approximation methods have been applied to recommendation systems, and interested readers can refer to [3, 20]. Based on the performance of our detection method on random attacks, we suppose it might be effective for detecting other attack models in which random ratings are used, e.g., the bandwagon attack in [12]. If that is true, attackers will need to incur a higher cost for mounting successful attacks by either using item average ratings instead of random ones or injecting more bots in each of which many fewer random ratings are used.

4. CONCLUSIONS AND FUTURE WORK
In this paper, we empirically show that an SVD-based algorithm [19] is much more resistant to random and average shilling attacks than two representative memory-based algorithms [8, 17]. Moreover, a detection method built on the SVD-based algorithm is demonstrated to be effective in detecting random attacks. Previous papers [5, 16, 19] have shown that the SVD-based algorithm has the same or better recommendation performance than memory-based algorithms. Our study suggests one more reason for recommendation system operators to prefer this algorithm. Since research on shilling attacks in recommendation systems is relatively new in the literature, there are a few research directions available for future work. One is to experiment with other model-based CF algorithms (e.g., [9] and [11]) and hybrid CF algorithms (e.g., [14] and [18]) to examine how effective shilling attacks are against them. A second is to explore whether there are specific attack models that have a larger likelihood of success in attacking the SVD-based algorithm and other model-based algorithms. A third possibility is to find new detection methods that can effectively detect average attacks.

Acknowledgment
This material is based in part upon work supported by the National Science Foundation under award number IDM

[1] Y. Azar, A. Fiat, A. Karlin, F. McSherry, and J. Saia. Spectral analysis of data. In Proceedings of the 33rd ACM Symposium on Theory of Computing, pages 619­626, 2001. [2] D. Billsus and M. J. Pazzani. Learning collaborative information filters. In Proceedings of the 15th International Conference on Machine Learning, pages 46­54, 1998. [3] M. Brand. Fast online SVD revisions for lightweight recommender system. In Proceedings of the 3rd SIAM International Conference on Data Mining, 2003. [4] R. Burke, B. Mobasher, R. Bhaumik, and C. Williams. Segment-based injection attacks against collaborative filtering recommender systems. In Proceedings of the International Conference on Data Mining (ICDM), pages 577­580, 2005. [5] J. Canny. Collaborative filtering with privacy via factor analysis. In Proceedings of the 25th ACM SIGIR Conference, pages 45­57, 2002. [6] K. Goldberg, T. Roeder, D. Gupta, and C. Perkins. Eigentaste: A constant time collaborative filtering algorithm. Information Retrieval, 4(2):133­151, 2001. [7] G. Golub and C. V. Loan. Matrix Computations (3rd edition). Johns Hopkins University Press, 1996. [8] J. L. Herlocker, J. A. Konstan, A. Borchers, and J. Riedl. An algorithmic framework for performing collaborative filtering. In Proceedings of the 22nd ACM SIGIR Conference, pages 230­237, 1999. [9] T. Hofmann. Latent semantic models for collaborative filtering. ACM Transaction on Information Systems, 22(1):89­115, 2004. [10] S. K. Lam and J. Riedl. Shilling recommender systems for fun and profit. In Proceedings of the 13th WWW Conference, pages 393­402, 2004. [11] B. Marlin. Modeling user rating profiles for collaborative filtering. In Proceedings of the 17th Annual Conference on Neural Information Processing Systems, 2003. [12] B. Mobasher, R. Burke, R. Bhaumik, and C. Williams. Effective attack models for shilling item-based collaborative filtering systems. In Proceedings of the WebKDD Workshop, 2005. [13] M. O'Mahony, N. Hurley, N. Kushmerick, and G. Silvestre. Collaborative recommendation: A robust analysis. ACM Transactions on Internet Technology, 4(4):344­377, 2004. [14] D. M. Pennock, E. Horvitz, S. Lawrence, and C. L. Giles. Collaborative filtering by personality diagnosis: A hybrid memory and model-based approach. In Proceedings of the 16th Conference on Uncertainty in Artificial Intel ligence, pages 473­480, 2000. [15] P. Resnick, N. Iacovou, M. Suchak, P. Bergstorm, and J. Riedl. GroupLens: An Open Architecture for Collaborative Filtering of Netnews. In Proceedings of

523


ACM Conference on Computer Supported Cooperative Work, pages 175­186, 1994. [16] B. M. Sarwar, G. Karypis, J. A. Konstan, and J. Riedl. Application of dimensionality reduction in recommender systems--a case study. In ACM WebKDD Web Mining for E-Commerce Workshop, 2000. [17] B. M. Sarwar, G. Karypis, J. A. Konstan, and J. Riedl. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th WWW Conference, pages 285­295, 2001. [18] A. I. Schein, A. Popescul, L. H. Ungar, and D. M. Pennock. Methods and metrics for cold-start

recommendations. In Proceedings of the 25th ACM SIGIR Conference, pages 253­260, 2002. [19] N. Srebro and T. Jaakkola. Weighted low rank approximation. In Proceedings of the 20th International Conference on Machine Learning, pages 720­727, 2003. [20] S. Zhang, W. Wang, J. Ford, F. Makedon, and J. Pearlman. Using singular value decomposition approximation for collaborative filtering. In Proceedings of the 7th IEEE Conference on E-commerce, pages 257­264, 2005.

524