WWW 2007 / Track: E*-Applications

Session: E-Commerce and E-Content

A Content-Driven Reputation System for the Wikipedia
B. Thomas Adler
Computer Science Dept. University of California 1156 High St., Santa Cruz, CA 95064, USA

Luca de Alfaro
Computer Engineering Dept. University of California 1156 High St., Santa Cruz, CA 95064, USA

ABSTRACT
We present a content-driven reputation system for Wikip edia authors. In our system, authors gain reputation when the edits they p erform to Wikip edia articles are preserved by subsequent authors, and they lose reputation when their edits are rolled back or undone in short order. Thus, author reputation is computed solely on the basis of content evolution; user-to-user comments or ratings are not used. The author reputation we compute could b e used to flag new contributions from low-reputation authors, or it could b e used to allow only authors with high reputation to contribute to controversial or critical pages. A reputation system for the Wikip edia could also provide an incentive for high-quality contributions. We have implemented the prop osed system, and we have used it to analyze the entire Italian and French Wikip edias, consisting of a total of 691,551 pages and 5,587,523 revisions. Our results show that our notion of reputation has good predictive value: changes p erformed by low-reputation authors have a significantly larger than average probability of having p oor quality, as judged by human observers, and of b eing later undone, as measured by our algorithms.

Categories and Subject Descriptors
H.5.3 [Information Interfaces and Presentation]: Group and Organization Interfaces--Computersupported cooperative work, Web-based interaction ; K.4.3 [Computers and Society]: Organizational Impacts-- Computer-supported col laborative work

General Terms
Algorithms, exp erimentation, measurement

Keywords
Wikip edia, reputation, user-generated content

One of the most successful examples of collab orative content creation, as well as one of the oldest, is the Wikip edia.1 The Wikip edia is a set of encyclop edias, each in a different language, that cover a very wide swath of knowledge, from history to geography, from math to p olitics, from p op culture to archeology. Anyone can contribute to the Wikip edia: when an article is displayed, the reader can click on an "edit" button, and is thus able to modify the content of the article. An imp ortant feature of the Wikip edia is that for each article, all versions are kept. Users can easily roll back an article to a previous version, undoing the contributions of other users. A fundamental insight b ehind wiki development is that, if well-intentioned and careful users outnumb er ill-intentioned or careless users, then valuable content will predominate, as the undesired contributions are easily undone [4]. We prop ose to use the information present in article versions to compute a content-driven reputation for Wikip edia authors. Authors gain reputation when the edits they p erform to Wikip edia articles are preserved by subsequent authors, and they lose reputation when the edits are undone in short order. The lifespan of an edit to a Wikip edia article is inferred from an analysis of the subsequent versions of the article. We call such reputation content-driven, since it is computed on the basis of how content evolves, rather than on the basis of user comments or ratings. A notion of author reputation can serve several purp oses in the Wikip edia. Author reputation provides a guide to the value of fresh contributions, which have not yet b een vetted by subsequent authors. The reputation of the authors who contributed and vetted the text can b e used as a rough guide to the veracity, or trust, of the article [19]. Author reputation can also b e used for author management. Highly controversial articles could b e protected, preventing low-reputation authors from editing them. Alternatively, editors could b e alerted when crucial or controversial articles are edited by low-reputation authors. Furthermore, reputation can consitute an incentive for authors to provide high-quality contributions.

1.

INTRODUCTION

1.1 A Content-Driven Reputation System
Most reputation systems are user-driven: they are based on users rating each other's contributions or b ehavior [16, 5]. A famous example is the rating system of Ebay, where buyers and sellers rate each other after p erforming transactions. In contrast, the reputation system we prop ose for the Wikip edia requires no user input: rather, authors are evaluated on the basis of how their contributions fare. Sup1

The collab orative, web-based creation of b odies of knowledge and information is a phenomenon of rising imp ortance. This research was supp orted in part by the grant CCR0132780.
Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005.

www.wikip edia.com

261


WWW 2007 / Track: E*-Applications
pose that an author A contributes to a Wikip edia article by editing it. When another author B subsequently revises the same article, she may choose to preserve some of the edits p erformed by A. By preserving them, B provides her vote of confidence in these edits, and in author A. Our reputation system will increase the reputation of A in a manner that dep ends on the amount of preserved edits, as well as on the reputation of B . Sp ecifically, our system measures the following quantities: · Text life: How much of the text inserted by A is still present after B 's edit. · Edit life: How much of the reorganization (text reordering and deletions) p erformed by A is preserved after B 's edit. Author A receives a reputation increment prop ortional to the size of his contribution, to the text and edit life of the contribution, and to the amount of reputation of B . Author A receives the largest reputation increment when B preserves his contribution in its entirety; conversely, A's reputation is most negatively affected when B rolls back A's contribution. More subtly, the way we compute text and edit life ensures that if B modifies the contribution of A, building up on it, then A still receives a reputation increment. Of the two criteria we use, text life is the most natural. We introduced edit life to account for article reorganizations that do not involve the creation of new text. For instance, if A reorders the paragraphs of an article, little text may b e inserted, but edit life enables to give credit to A for this reorganization, if it is preserved by B . A content-driven reputation system has an intrinsic objectivity advantage over user-driven reputation systems. In order to badmouth (damage the reputation of ) author B , an author A cannot simply give a negative rating to a contribution by B . Rather, to discredit B , A needs to undo some contribution of B , thus running the risk that if subsequent authors restore B 's contribution, it will b e A's reputation, rather than B 's, to suffer, as A's edit is reversed. Likewise, authors cannot simply praise each other's contributions to enhance their reputations: their contributions must actually withstand the test of time. Finally, the reputation is inferred uniformly and automatically from all edits; the automatic nature of the computation also ensures that the user exp erience of Wikip edia visitors and authors is minimally affected. Admittedly, a content-driven reputation system can b e less accurate than a user-driven one. Author contributions can b e deleted for a variety of reasons, including thorough article rewrites. Our reputation system cop es with this in two ways. First, as mentioned ab ove, the way we assign reputation is able to distinguish b etween reverted and refined contributions. Authors of reverted contributions lose reputation, while authors of subsequently refined contributions do not, even though b oth kinds of contributions do not survive in the long term. Second, contributions that are appropriate, but that are subsequently rewritten, tend to last longer than inappropriate contributions, which are usually reverted in short order: appropriate contributions thus stand a good chance of receiving at least partial credit. While it could b e arguably interesting to explore combinations of user-generated and content-driven reputation, in this pap er we focus on content-driven reputation alone. The computational nature of content-driven reputation enables

Session: E-Commerce and E-Content
us to evaluate its effectiveness directly on the Wikip edia: building a separate (and, inevitably, small-scale and lowtraffic) test wiki to exp eriment with the ideas is not necessary. We will present extensive data on the accuracy and p erformance of our content-driven reputation system, by applying it to the revision history of the Italian and French Wikip edias. The data will show that content-driven reputation can b e used to predict the quality of author's contributions.

1.2 Prescriptive, Descriptive, and Predictive Reputation
Reputation is most commonly used for its prescriptive and descriptive value: · Prescriptive value. Reputation systems sp ecify the way in which users can gain reputation, and thus define, and prescrib e, what constitutes "good b ehavior" on the users' part. Users are strongly encouraged to follow the prescrib ed b ehavior, lest their reputation -- and their ability to use the system -- suffer. · Descriptive value. Reputation systems can b e used to classify users, and their contributions, on the basis of their reputation, making it easier to sp ot high-quality users or contributions, and flagging contributions or users with low reputation. Ebay's reputation system is an example of a system that has b oth prescriptive, and descriptive, value. Users are strongly encouraged to accumulate p ositive feedback: it has b een documented that p ositive feedback increases the probability of closing a transaction, and in the case of sellers, is connected to higher selling price for goods [12]. The original PageRank algorithm [15] constitutes a reputation system for web pages with descriptive intent. Our reputation system for the Wikip edia has prescriptive and descriptive value. Reputation is built by contributing lasting content; by flagging author reputation, we encourage such contributions. The reputation we compute also has descriptive value: Wikip edia visitors can use the author's reputation as a guide to the trustworthiness of freshlycontributed text. In addition to prescriptive and descriptive values, we argue that a reputation system that is truly useful for the Wikip edia must also have predictive value: an author's reputation should b e statistically related to the quality of the author's future contributions. To a Wikip edia visitor, it is not generally very useful to know the reputation of an author who made a long-past contribution to an article. If the contribution survived for a long time, its quality is essentially proven: all subsequent editors to the same page have already implicitly voted on the contribution by leaving it in place. Reputation in the Wikip edia is most useful as a guide to the value of fresh contributions, which have not yet b een vetted by other authors. Reputation is most useful if it can predict how well these fresh contributions will fare; whether they are likely to b e long-lasting, or whether they are likely to b e reverted in short order. Furthermore, when reputation is used to grant or deny the ability to edit a page, it is its predictive value that matters: after all, why deny the ability to edit a page, unless there is some statistical reason to b elieve that there is an increased likelihood that the edits will need to b e undone?

262


WWW 2007 / Track: E*-Applications

Session: E-Commerce and E-Content
· The recal l provided by low reputation is 32.2% for short-lived edits and 37.8% for short-lived text. · The precision provided by low reputation is 23.9% for short-lived edits and 5.8% for short-lived text. If we consider also anonymous authors, the results improve markedly: · The recal l provided by low reputation, considering also anonymous users, is 82.9% for short-lived edits and 90.4% for short-lived text. · The precision provided by low reputation, considering also anonymous users, is 48.9% for short-lived edits and 19.0% for short-lived text. While we use search terminology to rep ort these results, we stress that these results are ab out the reputation's ability to predict the longevity of future contributions by an author, rather than retrieve past contributions. The ab ove results may underestimate the recall of our content-driven reputation. We asked a group of volunteers to decide, of the short-lived contributions to the Italian Wikip edia, which ones were of p oor quality. The results indicated that short-lived contributions by low-reputation authors were markedly more likely to b e of p oor quality, compared to similarly short-lived contributions by highreputation authors. This allowed us to calculate that, excluding anonymous authors, the recall for bad-quality, shortlived edits is 49%, and the recall for bad-quality short-lived text is 79%. We are unsure of the extent with which prediction precision can b e increased. Many authors with low reputation are good, but novice, contributors, who have not had time yet to accumulate reputation. Indeed, an unfortunate effect of the ease with which users can register anew to the Wikip edia is that we cannot trust novices any more than confirmed bad contributors -- if we trusted novices more, bad contributors would simply re-register. Furthermore, even authors who contribute short-lived text and edits, and who therefore have low reputation, do not do so consistently, interjecting good contributions among the bad ones. A basic form of reputation, currently used to screen contributors to controversial pages, is the length of time for which users have b een registered to the Wikip edia. We did not have access to this temp oral information in our study.3 As a proxy, we considered the numb er of edits p erformed by a user. This is a form of reputation that has no prescriptive value: all it does is encouraging users to split their contributions in multiple, small ones, or even worse, to p erform a multitude of gratuitous edits. Indeed, if edit count were used as reputation, it would most likely induce undesirable b ehaviors by the Wikip edia users. We will show that the predictive value of such a reputation is also inferior to the one we compute, although the difference is not large, and we will discuss why we b elieve this is the case.

1.3 Summary of the Results
In order to measure the predictive value of the prop osed reputation, we implemented our reputation system, and we used it to analyze all edits done to the Italian Wikip edia from its inception to Octob er 31, 2005 (154,621 pages, 714,280 versions), and all edits done to the French Wikip edia from its inception, to July 30, 2006 (536,930 pages, 4,837,243 versions).2 We then measured the statistical correlation b etween the author's reputation at the time an edit was done, and the subsequent lifespan of the edit. This is a valid statistical test, in the sense that the two variables of the study are computed in indep endent ways: the reputation of an author at the time of the edit is computed from the b ehavior of the author before the edit, while the lifespan of the edit is determined by events after the edit. To summarize the results, we introduce the following informal terminology; we will make the terms fully precise in the following sections: · Short-lived edit is an edit which is 80% undone in the next couple of revisions; · Short-lived text is text that is almost immediately removed (only 20% of the text in a version survives to the next version). · Low-reputation author is an author whose reputation falls in the b ottom 20% of the reputation scale. When measuring the quantity of text or edits, our unit of measurement is a word: this provides a uniform unit of measurement, and ensures that splitting changes in two or more revisions does not affect the results. Anonymous authors are the largest source of short-lived contributions; as we cannot distinguish their identity, we keep their reputation fixed to the low value assigned to new authors. We first present the results excluding anonymous authors from the evaluation. This enables us to measure how well our notion of reputation predicts the quality of contributions by authors with trackable identity. Our results indicate that the fact that an author's reputation is low at the time a contribution is made is statistically correlated with the contribution b eing short-lived. For the French Wikip edia, the results for edits are as follows: · While 5.7% of all edits are short-lived, 23.9% of the edits by low-reputation authors are short-lived. Thus, edits by low-reputation authors are 4.2 times more likely than average to b e short-lived. · 32.2% of short-lived edits are p erformed by lowreputation authors. For text, the situation is as follows: · While 1.3% of all text is short-lived, 5.8% of text contributed by low-reputation authors is short-lived. Thus, text by low-reputation authors is 4.5 times more likely than average to b e short-lived. · 37.8% of short-lived text is contributed by lowreputation authors. Using search terminology, these results can b e restated as follows: These numb ers reflect the fact that we consider only the last among consecutive versions by the same author, as explained later.
2

1.4 Related Work
The work most closely related to ours is [19], where the revision history of a Wikip edia article is used to compute a trust value for the article. Dynamic Bayesian networks
3 Wikip edia makes only a limited amount of user information available for download.

263


WWW 2007 / Track: E*-Applications
are used to model the evolution of trust level over the versions. At each revision, the inputs to the network are a priori models of trust of authors (determined by their Wikip edia ranks), and the amount of added and deleted text. The pap er shows that this approach can b e used to predict the quality of an article; for instance, it can b e used to predict when an article in a test set can b e used as a featured article. Differently from the current work, author trustworthiness is taken as input; we compute author reputation as output. Several approaches for computing text trust are outlined in [13]. A simpler approach to text trust, based solely on text age, is advocated in [3]. Document trust, and author reputation, have different applications. Document trust provides a measure of confidence in the accuracy of the document. Author reputation is most useful as an indicator of quality for fresh text, and can b e used to manage author activity, for instance, preventing lowreputation authors from editing some pages. Reputation systems in e-commerce and social networks has b een extensively studied (see, e.g., [10, 16, 5, 9]); the reputation in those systems is generally user-driven, rather than content-driven as in our case. Related is also work on trust in social networks (see, e.g., [7, 6]). The history flow of text contributed by Wikip edia authors has b een studied with flow visualization methods in [18]; the results have b een used to analyze a numb er of interesting patterns in the content evolution of Wikip edia articles. Work on mining software revision logs (see, e.g., [11]) is similar in its emphasis of in-depth analysis of revision logs; the aim, however, is to find revision patterns and indicators that p oint to software defects, rather than to develop a notion of author reputation.

Session: E-Commerce and E-Content
we have ai = ai+1 . Every version vi , for 0  i  n, consists i i of a sequence [w1 , . . . , wmi ] of words, where mi is the numb er of words of vi ; we have m0 = 0. A word is a whitespacedelimited sequence of characters in the Wiki markup language: we work at the level of such markup language, rather than at the level of the HTML produced by the wiki engine. Given a series of versions v0 , v1 , . . . , vn of a text document, we assume that we can compute the following quantities: · txt (i, j ), for 0 < i  j  n, is the amount of text (measured in numb er of words) that is introduced by ri in vi , and that is still present (and due to the same author ai ) in vj . In particular, txt (i, i) is the amount of new text added by ri . · d(vi , vj ), for 0 < i < j  n, is the edit distance b etween vi and vj , and measures how much change (word additions, deletions, replacements, displacements, etc.) there has b een in going from vi to vj . We will describ e in the next section how these quantities can b e computed from the series of versions v0 , . . . , vn .

2.2 Content-Driven Reputation in a Versioned Document
We prop ose the following method for computing contentdriven reputation in a versioned document. Consider a revision ri : vi-1 vi , p erformed by author ai , for some 0 < i  n. Each of the subsequent authors ai+1 , ai+2 , . . . can either retain, or remove, the edits p erformed by ai at ri . Thus, we consider later versions vj of the document, for i < j  n, and we increase, or decrease, the reputation of ai according to how much of the change p erformed at ri has survived until vj . The greater is the reputation of the author aj of vj , the greater is the change in the reputation of ai . This ensures that high-reputation authors risk little of their reputation when fighting revision wars against low reputation authors, such as anonymous authors. If this were not the case, high-reputation authors would b e wary of undoing damage done by low-reputation authors, including spammers, for fear of reprisal. We use two criteria for awarding reputation: the survival of text, and the survival of edits. Text survival is the most natural of these two criteria. Adding new text to a Wikip edia article is a fundamental way of contributing -- it is how knowledge is added to the Wikip edia. However, text survival alone fails to capture many ways in which authors contribute to the Wikip edia. If a user rearranges the content of an article without introducing new text, the contribution cannot b e captured by text survival. Similarly, reverting an act of vandalism does not result in the introduction of new text.4 Edit survival captures how long the re-arrangements p erformed by an author last in the history of a Wikip edia article, and captures all of the ab ove ways of contributing.

2.

CONTENT-DRIVEN REPUTATION

Reputation systems can b e classified into two broad categories: chronological, where the reputation is computed from the chronological sequence of ratings a user receives, and fixpoint, where the reputation is computed via a fixp oint calculation p erformed over the graph of feedbacks. The Ebay reputation system is an example of a chronological system, while the PageRank and HITS algorithms are examples of fixp oint algorithms [15, 10]. We chose to follow the chronological approach to develop our content-driven reputation for the Wikip edia. The chief advantage of a chronological approach is that it is computationally lightweight. When an author revises a Wikip edia article, we can efficiently compute the feedback to authors of previous versions of the same article, and we can modify their reputation in real-time, with little impact on server resp onse time.

2.1 Notation
The following notation will b e used throughout the pap er. We assume that we have n > 0 versions v0 , v1 , v2 , . . . , vn of a document; version v0 is empty, and version vi , for 1  i  n, is obtained by author ai p erforming a revision ri : vi-1 vi . We refer to the change set corresp onding to ri : vi-1 vi as the edit p erformed at ri : the edit consists of the text insertions, deletions, displacements, and replacements that led from vi-1 to vi . When editing a versioned documents, authors commonly save several versions in a short time frame. To ensure that such b ehavior does not affect reputations, we filter the versions, keeping only the last of consecutive versions by the same author; we assume thus that for 1  i < n

2.3 Accounting for Text Survival
If text introduced by ri is still present in vj , for 0 < i < j  n, this indicates that author aj , who p erformed the revision rj : vj -1 vj , agrees that the text is valuable. To reflect this, we increase the reputation of ai in a manner that is prop ortional b oth to the amount of residual text, and to the reputation of ai . Thus, we prop ose the following rule.
4 Our algorithms distinguish b etween new text, and the reintroduction of previously deleted text.

264


WWW 2007 / Track: E*-Applications
Rule 1. (reputation update due to text survival) When the revision rj occurs, for al l 0 < i < j such that j - i  10 and aj = ai , we add to the reputation of ai the amount: txt(i, j ) cscale · ctext · · (txt(i, i))clen · log (1 + R(aj , rj )), txt(i, i) where cscale > 0, ctext  [0, 1], and clen  [0, 1] are parameters, and where R(aj , rj ) is the reputation of aj at the time rj is performed. In the rule, txt (i, j )/txt (i, i) is the fraction of text introduced at version vi that is still present in version vj ; this is a measure of the "quality" of ri . The quantity log(1 + R(aj , rj )) is the "weight" of the reputation of aj , that is, how much the reputation of aj lends credibility to the judgements of aj . In our reputation system, the reputations of many regular contributors soar to very high values; using a logarithmic weight for reputation ensures that the feedback coming from new authors is not completely overridden by the feedback coming from such regular contributors. The parameters cscale , ctext and clen will b e determined exp erimentally via an optimization process, as describ ed later. The parameter clen  [0, 1] is an exp onent that sp ecifies how to take into account the length of the original contribution: if clen = 1, then the increment is prop ortional to the length of the original contribution; if clen = 0, then the increment does not dep end on the length of the original contribution. The parameter cscale sp ecifies how much the reputation should vary in resp onse to an individual feedback. The parameter ctext sp ecifies how much the feedback should dep ends on residual text (Rule 1) or residual edit (Rule 2, presented later). To give feedback on a revision, the rule considers at most 10 successive versions. This ensures that contributors to early versions of articles do not accumulate disprop ortionate amounts of reputation. A rule using exp onential decay, rather than a sharp cutoff at 10, would probably have b een preferable, at a slightly larger computational cost.

Session: E-Commerce and E-Content
to ensure that authors whose contributions are rewritten, rather than rolled back, still receive reputation in return for their efforts. The second mechanism is describ ed in the next section. We use the following rule for up dating reputations. Rule 2. (reputation update due to edit survival) When the revision rj occurs, for al l 0 < i < j such that j - i  3, we add to the reputation of ai an amount q determined as fol lows. If aj = ai or d(vi-1 , vi ) = 0, then q = 0; otherwise, q is determined by the fol lowing algorithm. cslack · d(vi-1 , vj ) - d(vi , vj ) d(vi-1 , vi ) if q < 0 then q := q · cpunish endif q := q := q · cscale · (1 - ctext ) · (d(vi-1 , vi ))clen · log (1 + R(aj , rj )) In the algorithm, cpunish  1, cslack  1, cscale > 0, ctext  [0, 1], and clen  [0, 1] are parameters, and R(aj , rj ) is the reputation of aj at the time rj is performed. The rule adopts a modified version of (1). The parameter cslack > 1 is used to spare ai from punishment when ri : vi-1 vi is only slightly counterproductive. On the other hand, when punishment is incurred, its magnitude is magnified by the amount cpunish , raising the reputation cost of edits that are later undone. The parameters cslack and cpunish , as well as cscale , ctext and clen , will b e determined via an optimization process. To assign edit feedback, we have chosen to consider only the 3 previous versions of an article. This approach proved adequate for analyzing an alreadyexisting wiki, in which authors could not modify their b ehavior using knowledge of this threshold. If the prop osed content-driven reputation were to b e used on a live Wiki, it would b e advisable to replace this hard threshold by a scheme in which the feedback of vj on ri is weighed by a gradually decreasing function of j - i (such as exp(c · (i - j )) for some c > 0).

2.4 Accounting for Edit Survival
To judge how much of a revision ri : vi-1 vi is preserved in a later version vj , for 0 < i < j  n, we reason as follows. We wish to rate ri higher, if ri made the article more similar to vj . In particular, we wish to credit ai in prop ortion to how much of the change ri is directed towards vj . This suggests using the formula: ELong (i, j ) = d(vi-1 , vj ) - d(vi , vj ) . d(vi-1 , vi ) (1)

2.5 Computing Content-Driven Reputation
We compute the reputation for Wikip edia authors as follows. We examine all revisions in chronological order, thus simulating the same order in which they were submitted to the Wikip edia servers. We initialize the reputations of all authors to the value 0.1; the reputation of anonymous authors is fixed to 0.1. We choose a p ositive initial value to ensure that the weight log(1 + r ) of an initial reputation r = 0.1 is non-zero, priming the process of reputation computation. This choice of initial value is not particularly critical (the parameter cscale may need to b e adjusted for optimal p erformance, if this initial value is changed). As the revisions are processed, we use Rules 1 and 2 to up date the reputations of authors in the system. When up dating reputations, we ensure that they never b ecome negative, and that they never grow b eyond a b ound cmaxrep > 0. The b ound cmaxrep prevents frequent contributors from accumulating unb ounded amounts of reputation, and b ecoming essentially immune to negative feedback. The value of cmaxrep will b e once again determined via optimization techniques. Wikip edia allows users to register, and create an author identity, whenever they wish. As a consequence, we need to make the initial reputation of new authors very low, close to the minimum p ossible (in our case, 0). If we made the initial reputation of new authors any higher, then authors, after committing revisions that damage their reputation,

If d satisfies the triangular inequality (as our edit distance does, to a high degree of accuracy), then ELong (i, j )  [-1, 1]. For two consecutive edits ri , ri+1 , if ri is completely undone in ri+1 (as is common when ri consists in the introduction of spam or inappropriate material), then ELong (i, i + 1) = -1; if ri+1 entirely preserves ri , then ELong (i, i + 1) = +1. For more distant edits, ELong (i, j ) is a measure of how much of the edit p erformed during ri is undone (value -1) or preserved (value +1) b efore rj . Note that ELong (i, j ) < 0, i.e., j votes to lower the reputation of i, only when d(vi-1 , vj ) < d(vi , vj ), that is, when rj is closer to the version vi-1 preceding ai 's contribution, than to vi . Thus, the revision p erformed by ai needs to have b een undone at least in part by ai+1 , ai+2 , . . . , aj , for ai 's reputation to suffer. This is one of the two mechanisms we have

265


WWW 2007 / Track: E*-Applications
would simply re-register as new users to gain the higher value. An unfortunate side-effect of allowing p eople to obtain new identities at will is that we cannot presume that p eople are innocent until proven otherwise: we have to assign to newcomers the same reputation as proven offenders. This is a contributing factor to our reputation having low precision: many authors who have low reputation still p erform very good quality revisions, as they are simply new authors, rather than proven offenders.

Session: E-Commerce and E-Content
To compute Ci for all versions i of a document, we prop ose an algorithm that proceeds as follows. For the initial version, 1 1 1 we let C1 = [[(w1 , 1), (w2 , 1), . . . , (wm1 , 1)]]. For 1  i < n, the algorithm computes Ci+1 from Ci = [ci , ci , . . . , ci ] and 01 k vi+1 . To compute the live chunk ci+1 , we match contigu0 ous p ortions of text in vi+1 with contiguous text in any of the chunks in Ci ; the matching words in vi+1 are lab eled with the version index that lab els them in Ci , and represent words that were introduced prior to version vi+1 . Any words of vi+1 that cannot b e thus matched are considered new in vi+1 , and are lab eled with version index i + 1. The dead chunks ci+1 , . . . , ci+1 of Ci+1 are then obtained as the p or1 l tions of the chunks in Ci that were not matched by any text in vi+1 . We allow the same text in Ci to b e matched multiple timed in vi+1 : if a contributor copies multiple times text present in vi or in prior versions in order to obtain vi+1 , the replicated text should not b e counted as new in vi+1 . Considering replicated text as new would op en the door to a duplication attack, whereby an attacker duplicates text in a revision ri , and then removes the original text in a revision rk : vk-1 vk with k > i. From version vk onwards, the text would b e attributed to the attacker rather than to the original author. Matching vi+1 with Ci is a matter of finding matches b etween text strings, and several algorithms have b een presented in the literature to accomplish this in an efficient manner (see, e.g., [8, 17, 14, 1]). We exp erimented extensively, and the algorithm that gave the b est combination of efficiency and accuracy was a variation of a standard greedy algorithm. In standard greedy algorithms, such as [14, 1], longest matches are determined first. In our algorithm, we define a notion of match quality, and we determine first matches of highest quality. To define match quality, we let mi+1 b e the length of vi+1 , and we let m b e the length of the chunk of Ci where the match is found (all length and indices are measured in numb er of words). Let l b e the length of the match, and assume that the match b egins at word k  m in the chunk, and at word ki+1  mi+1 in vi+1 . We define match quality as follows: · If the match occurs b etween vi+1 and the live chunk, then the quality is: m  k l ki+1  . - 0.3 ·    -m min(mi+1 , m ) i+1 · If the match occurs b etween vi+1 and a dead chunk, then the quality is 0 if l < 4, and is l/min(mi+1 , m ) - 0.4 otherwise. Thus, the quality of a match is the higher, the longer the match is. If the match is with the live chunk, a match has higher quality if the text app ears in the same relative p osition in vi+1 and in vi . Matches with dead chunks have somewhat lower quality than matches with the live chunk: this corresp onds to the fact that, if some text can b e traced b oth to the previous version (the live chunk), and to some text that was previously deleted, the most likely match is with the text of the previous version. Moreover, matches with dead chunks have to b e at least of length 4: this avoids misclassifying common words in new text as reintroductions of previously-deleted text. The coefficients in the ab ove definition of quality have b een determined exp erimentally, comparing human judgements of authorship to the

3.

TEXT LONGEVITY AND EDIT DISTANCE IN VERSIONED DOCUMENTS

In this section, we prop ose methods for tracking text authorship, and computing edit distances, in versioned documents. We develop ed our algorithms starting from standard text-difference algorithms, such as those of [17, 14, 1]. However, we adapted the algorithms in several places, to enable them to b etter cop e with the versioned nature of the Wikip edia, and with the kind of edits that authors p erform.

3.1 Tracking Text Authorship
Given a sequence of versions v0 , v1 , . . . , vn , we describ e an algorithm for computing txt (i, j ) for 0 < i  j  n. Sup erficially, it might seem that to compute txt (i, j ), we need to consider only three versions of the document: vi-1 , vi , and vj . From vi-1 and vi we can derive the text that is added at revision ri : vi-1 vi ; we can then check how much of it survives until vj . This approach, however, is not appropriate for a versioned document like a wiki page, where authors are allowed to insp ect and restore text from any previous version of a document. For example, consider the case in which revision ri-1 : vi-2 vi-1 is the work of a spammer, who erases entirely the text of vi-2 , and replaces it with spurious material; such spam insertions are a common occurrence in op en wikis. When author ai views the page, she realizes that it has b een damaged, and she reverts it to the previous version, so that vi = vi-2 . If we derived the text added by ri by considering vi-1 and vi only, it would app ear to us that ai is the original author of all the text in vi , but this is clearly not the case: she simply restored pre-existing text. To compute the text added at a revision ri : vi-1 vi , we keep track of b oth the text that is in vi-1 , and of the text that used to b e present in previous versions, and that has subsequently b een deleted. Our algorithm proceeds as follows. We call a chunk a list c = [(w1 , q1 ), . . . , (wk , qk )], where for 1  j  k, wj is a word, and qj  {1, . . . , n} is a version numb er. A chunk represents a list of contiguous words, each lab eled with the version where it originated. The algorithm computes, for each i i version vi , 1  i  n, its chunk list Ci = [c0 , c1 , . . . , ci ]. k i The chunk c0 is the live chunk, and it consists of the text of vi , with each word lab eled with the version where i i the word was introduced; thus, if vi = [w1 , . . . , wmi ], we i have c0 = [(w1 , q1 ), . . . , (wmi , qmi )], for some q1 , . . . , qmi . The chunks ci , . . . , ci are dead chunks, and they repre1 k sent contiguous p ortions of text that used to b e present in some version of the document prior to i. Given the chunk list Ci = [ci , ci , . . . , ci ] for document vi , we can compute 01 k txt (j, i) via  ¯ txt (j, i) =  (u, j ) | u.(u, j )  ci , 0 for 1  i  j  n.

266


WWW 2007 / Track: E*-Applications
algorithmically computed ones for many pages of the Italian Wikip edia. The text survival algorithm we develop ed is efficient: the main b ottleneck, when computing text authorship, is not the running time of the algorithm, but rather, the time required to retrieve all versions of a page from the MySQL database in which Wikip edia pages are stored.5

Session: E-Commerce and E-Content
Due to the nature of the greedy algorithms used for text matching, and of the definitions ab ove, our edit distance is not guaranteed to satisfy the triangular inequality. However, we found exp erimentally that the prop osed edit distance, on Wikip edia pages, satisfies the triangular inequality within approximately one unit (one word) for well over 99% of triples of versions of the same page.

3.2 Computing Edit Distances
For two versions v , v , the edit distance d(v , v ) is a measure of how many word insertions, deletions, replacements, and displacements are required to change v into v [17]. In order to compute the edit distance b etween two versions v and v , we use the same greedy algorithm for text matching that we used for text survival, except that each p ortion of text in v (resp. v ) can b e matched at most once with a p ortion of text in v (resp. v ). Thus, text duplication is captured as an edit. The output of the greedy matching is modified, as is standard in measurements of edit distance, c so that it outputs a list L of elements that describ e how v an b e obtained from v : · I (j, k): k words are inserted at p osition j ; · D(j, k): k words are deleted at p osition j ; · M (j, h, k): k words are moved from p osition j in v to p osition h in v . We compute the total amount Itot of inserted text by summing, for each I (j, k)  L, the length k; similarly, we obtain the total amount Dtot of deleted text by summing, for each D(j, k)  L, the length k. We take into account insertions and deletions via the formula max(Itot , Dtot ) - 1 min(Itot , Dtot ): thus, every word that is inserted or re2 moved contributes 1 to the distance, and every word that is replaced contributes 1 . The motivation for this treatment of 2 replacements is as follows. Supp ose that author ai edits vi-1 adding a new paragraph consisting of k words, obtaining vi , and supp ose that author ai+1 then rewrites completely the paragraph (keeping it of equal length), obtaining vi+1 . If we accounted for insertion and deletions via Itot + Dtot , we would have d(vi+1 , vi-1 ) = k and d(vi+1 , vi ) = 2k: consequently, according to (1) and Rule 2, ai 's reputation would suffer. With our formula, we have instead d(vi+1 , vi-1 ) = k and d(vi+1 , vi ) = k/2, so that ai 's reputation increases. This is the second mechanism we have for ensuring that authors whose contributions are rewritten, rather than rolled back, still receive credit for their effort. We account for text moves b etween versions v and v as follows. Let l, l b e the lengths (in words) of v and v , resp ectively. Each time a block of text of length k1 exchanges p osition with a block of text of length k2 , we count this as k1 · k2 / max(l, l ). Thus, a word that moves across k other words contributes k/ max(l, l ) to the distance: the contribution approaches 1 as the word is moved across the whole document. The total contribution Mtot of all moves can b e computed by adding k · k , for all pairs of moves M (j, h, k)  L and M (j , h , k )  L such that j < j and h > h (this ensures that every crossing is counted once). We finally define: d(r, r
5 )

4. EVALUATION METRICS
We now develop quantitative measures of the ability of our content-driven reputation to predict the quality of future revisions. For a revision ri : vi-1 vi in a sequence v0 , v1 , . . . , vn of versions, let t (ri ) = txt (i, i) b e the new text introduced at ri , and e (ri ) = d(vi-1 , vi ) b e the amount of editing involved in ri . We define edit and text longevity as follows: · The edit longevity e (ri )  [-1, 1] of ri is the average of ELong (i, j ) for i < j  min(i + 3, n). · The text longevity t (ri )  [0, 1] of ri is the solution to the following equation:
n X j =i

txt (i, j ) = txt (i, i) ·

n X j =1

(t (ri ))j -i .

(2)

Thus, t (ri ) is the coefficient of exp onential decay of the text introduced by ri : on average, after k revisions, only a fraction (t (ri ))k of the introduced text survives. As all quantities in (2) except t (ri ) are known, we can solve for t (ri ) using standard numerical methods. We also indicate by rep (ri ) the reputation of the author ai of ri at the time ri was p erformed. We note that rep (ri ) is computed from the history of the Wikip edia b efore ri , while e (ri ) and t (ri ) dep end only on events after ri . Moreover, e (ri ) and t (ri ) can b e computed indep endently of reputations. Let R b e the set of all revisions in the Wikip edia (of all articles). We view revisions as a probabilistic process, with R as the set of outcomes. We associate with each revision a probability mass (a weight) prop ortional to the numb er of words it affects. This ensures that the metrics are not affected if revisions by the same author are combined or split into multiple ones. Since we keep only the last among consecutive revisions by the same user, a "revision" is a rather arbitrary unit of measurement, while a "revision amount" provides a b etter metric. Thus, when studying edit longevity, we associate with each r  R a probability mass prop ortional to e (r ), giving rise to the probability measure Pre . Similarly, when studying text longevity, we associate with each r  R a probability mass prop ortional to t (r ), giving rise to the probability measure Prt . In order to develop figures of merit for our reputation, we define the following terminology (used already in the introduction in informal fashion): · We say that the edit p erformed in r is short-lived if e (r )  -0.8. · We say that the new text added in r is short-lived if t (r )  0.2, indicating that at most 20% of it, on average, survives from one version to the next. · We say that a revision r is low-reputation if log(1 + rep (r ))  log (1 + cmaxrep )/5, indicating that the reputation, after logarithmic scaling, falls in the lowest 20% of the range.

= max(Itot , Dtot ) + Mtot -

1 2

min(Itot , Dtot ).

The measurement was done on a PC with AMD Athlon 64 3000+ CPU, two hard drives configured in RAID 1 (mirroring), and 1 GB of memory.

267


WWW 2007 / Track: E*-Applications
Correspondingly, we define three random variables Se , St , L : R  {0, 1} as follows, for all r  R: · Se (r ) = 1 if e (r )  -0.8, and Se (r ) = 0 otherwise. · St (r ) = 1 if e (r )  0.2, and St (r ) = 0 otherwise. · L(r ) = 1 if log(1 + rep (r ))  log(1 + cmaxrep )/5, and L(r ) = 0 otherwise. For short-lived text, the precision is prec t = Prt (St = 1 | L = 1), and the recal l is rec t = Prt (L = 1 | St = 1). Similarly, for short-lived edits, we define the precision is prec e = Pre (Se = 1 | L = 1), and the recall is rec e = Pre (L = 1 | Se = 1). These quantities can b e computed as usual; for instance, P r R Se (r ) · L(r ) · e (r ) P Pre (Se = 1 | L = 1) = . r R L(r ) · e (r ) We also define: Pre (Se = 1 | L = 1) Pre (Se = 1, L = 1) boost e = = Pre (Se = 1) Pre (Se = 1) · Pre (L = 1) Prt (St = 1, L = 1) Prt (St = 1 | L = 1) = boost t = Prt (St = 1) Prt (St = 1) · Prt (L = 1) Intuitively, boost e indicates how much more likely than average it is that edits produced by low-reputation authors are short-lived. The quantity boost t has a similar meaning. Our last indicator of quality are the coefficients of constraint e = Ie (Se , L)/He (L) t = It (St , L)/Ht (L),

Session: E-Commerce and E-Content
2006 for the French one. Our algorithms for computing content-driven reputation dep end on the value of six parameters, as mentioned earlier. We determined values for these parameters by searching the parameter space to optimize the coefficient of constraint, using the Italian Wikip edia as a training set; the values we determined are: cscale = 13.08 cslack = 2.20 cpunish = 19.09 ctext = 0.60 clen = 0.60 cmaxrep = 22026

where Ie is the mutual information of Se and L, computed with resp ect to Pre , and He is the entropy of L, computed with resp ect to Pre [2]; similarly for It (St , L) and Ht (L). The quantity e is the fraction of the entropy of the edit longevity which can b e explained by the reputation of the author; this is an information-theoretic measure of correlation. The quantity t has an analogous meaning. To assign a value to the coefficients cscale , cslack , cpunish , ctext , clen , and cmaxrep , we implemented a search procedure, whose goal was to find values for the parameters that maximized a given ob jective function. We applied the search procedure to the Italian Wikip edia, reserving the French Wikip edia for validation, once the coefficients were determined. We exp erimented with Ie (Se , L) and prec e · rec e as ob jective functions, and they gave very similar results.

We then analyzed the Italian and French Wikip edias using the ab ove values. The results are summarized in Table 1. The results are b etter for the larger French Wikip edia; in particular, the reputation's ability to predict short-lived edits is b etter on the French than on the Italian Wikip edias. We are not sure whether this dep ends on different dynamics in the two Wikip edias, or whether it is due to the greater age (and size) of the French Wikip edia; we plan to study this in further work. We see that edits p erformed by lowreputation authors are four times as likely as the average to b e short-lived. To investigate how many of the edits had a short life due to bad quality, we asked a group of 7 volunteers to rate revisions p erformed to the Italian Wikip edia. We selected the revisions to b e ranked so that they contained representatives of all 4 combinations of high/low reputation author, and high/low longevity. We asked the volunteers to rate the revisions with +1 (good), 0 (neutral), and -1 (bad); in total, 680 revisions were ranked. The results, summarized in Table 2, are striking. Of the short-lived edits p erformed by low-reputation users, fully 66% were judged bad. On the other hand, less than 19% of the short-lived edits p erformed by high-reputation users were judged bad. We analyzed in detail the relationship b etween user reputation, and the p ercentage of short-lived text and edits that users considered bad. Using these results, we computed the approximate recall factors on the Italian Wikip edia of contentdriven reputation for bad edits, as judged by users, rather than short-lived ones: · The recall for short-lived edits that are judged to b e bad is over 49%. · The recall for short-lived text that is judged to b e bad is over 79%. These results clearly indicate that our content-driven reputation is a very effective tool for sp otting, at the moment they are introduced, bad contributions that will later b e undone. There is some margin of error in this data, as our basis for evaluation is a small numb er of manually-rated revisions, and human judgement on the same revisions often contained discrepancies. The fact that so few of the short-lived edits p erformed by high-reputation authors were judged to b e of bad quality p oints to the fact that edits can b e undone for reasons unrelated to quality. Many Wikip edia articles deal with current events; edits to those articles are undone regularly, even though they may b e of good quality. Our algorithms do not treat in any sp ecial way current-events pages. Other Wikip edia edits are administrative in nature, flagging pages that need work or formatting; when these flags are removed, we classify it as text deletion. Furthermore, our algorithms do not track text across articles, so that when text is moved

5.

EXPERIMENTAL RESULTS

To evaluate our content-driven reputation, we considered two Wikip edias: · The Italian Wikip edia, consisting of 154,621 articles and 714,280 filtered revisions; we used a snapshot dated Decemb er 11, 2005. · The French Wikip edia, consisting of 536,930 articles and 4,837,243 filtered revisions; we used a snapshot dated Octob er 14, 2006. Of b oth wikip edias, we studied only "NS MAIN" pages, which corresp ond to ordinary articles (other pages are used as comment pages, or have other sp ecialized purp oses). Moreover, to allow the accurate computation of edit longevity, we used only revisions that occurred b efore Octob er 31, 2005 for the Italian Wikip edia, and b efore July 31,

268


WWW 2007 / Track: E*-Applications
Precision Edit Text prec e prec t Excluding anonymous authors: French Wikip edia Italian Wikip edia Including anonymous authors: French Wikip edia Italian Wikip edia 23.92 14.15 48.94 30.57 5.85 3.94 19.01 7.64 Recall Edit Text rec e rec t 32.24 19.39 82.86 71.29 37.80 38.69 90.42 84.09

Session: E-Commerce and E-Content
Boost Edit Text boost e boost t 4.21 4.03 2.35 3.43 4.51 5.83 2.97 3.58 Coeff. of constr. Edit Text e t 7.33 3.35 25.29 19.83 6.29 7.17 23.00 17.49

Table 1: Summary of the performance of content-driven reputation over the Italian and French Wikipedias. All data are expressed as percentages. Reputation Short-lived edits: Low [0.0­0.2] Normal [0.2­1.0] Short-lived text: Low [0.0­0.2] Normal [0.2­1.0] Judged bad 66 % 16 % 74 % 14 % Judged good 19 % 68 % 13 % 85 %

50% 40% 30% 20% 10% Italian Wikip edia, edit Italian Wikip edia, text French Wikip edia, edit French Wikip edia, text

Table 2: User ranking of short-lived edits and text, as a function of author reputation, for the Italian Wikipedia. In square brackets, we give the interval where the normalized value log(1 + r )/ log (1 + cmaxrep ) of a reputation r falls. The precentages do not add to 100%, because users could also rank changes as "neutral". from one article to another, they classify it as deleted from the source article. From Table 1, we note that the precision is low, by search standards. Our problem, however, is a prediction problem, not a retrieval problem, and thus it is intrinsically different. The group of authors with low reputation includes many authors who are good contributors, but who are new to the Wikip edia, so that they have not had time yet to build up their reputation. Figure 1 provides a breakdown of the amount of edits and text additions p erformed, according to the reputation of the author.

0% 0 1 2 3 4 5 6 7 8 9 log (1 + reputation) Figure 1: Percentage of text and edit contributed to the Italian and French Wikipedias, according to author reputation. The data includes anonymous authors. p erforming edits that are often criticized and reverted, commonly either give up their identity in favor of a "fresh" one, thus zeroing their edit-count reputation (thus "punishing" themselves), or stop contributing to the Wikip edia. However, we b elieve that the good p erformance of edit count is an artifact, due to the fact that edit count is applied to an already-existing history of contributions. Were it announced that edit count is the chosen notion of reputation, authors would most likely modify their b ehavior in a way that b oth rendered edit count useless, and damaged the Wikip edia. For instance, it is likely that, were edit count the measure of reputation, authors would adopt strategies (and automated rob ots) for p erforming very many unneeded edits to the Wikip edia, causing instability and damage. In other words, edit count as reputation measure has very little prescriptive value. In contrast, we b elieve our content-driven reputation, by prizing long-lasting edits and content, would encourage constructive b ehavior on the part of the authors.

5.1 Comparison with Edit-Count Reputation
We compared the p erformance of our content-driven reputation to another basic form of reputation: edit count. It is commonly b elieved that, as Wikip edia authors gain exp erience (through revision comments, talk pages, and reading articles on Wikip edia standards), the quality of their submissions goes up. Hence, it is reasonable to take edit count, that is, the numb er of edits p erformed, as a form of reputation. We compare the p erformance of edit count, and of content-driven reputation, in Table 3. The comparison does not include anonymous authors, as we do not have a meaningful notion of edit-count for them. As we can see, according to our metrics, content-driven reputation p erforms slightly b etter than edit-count reputation on b oth the Italian and French Wikip edias. We b elieve that one reason edit-count based reputation p erforms well in our measurements is that authors, after

5.2 Text Age and Author Reputation as Trust Criteria
The age of text in the Wikip edia is often considered as an indicator of text trustworthiness, the idea b eing that text that has b een part of an article for a longer time has b een vetted by more contributors, and thus, it is more likely to b e correct [3]. We were interested in testing the hyp othesis that author reputation, in addition to text age, can b e a useful indicator of trustworthiness, esp ecially for

269


WWW 2007 / Track: E*-Applications
Precision Edit Text prec e prec t Italian Wikipedia: Content-driven reputation Edit count as reputation French Wikipedia: Content-driven reputation Edit count as reputation 14.15 11.50 23.92 21.62 3.94 3.32 5.85 5.63 Recall Edit Text rec e rec t 19.39 19.09 32.24 28.30 38.69 39.52 37.80 37.92

Session: E-Commerce and E-Content
Boost Edit Text boost e boost t 4.03 3.27 4.21 3.81 5.83 4.91 4.51 4.34 Coeff. of constr. Edit Text e t 3.35 2.53 7.33 5.61 7.17 6.35 6.29 6.08

Table 3: Summary of the performance of content-driven reputation over the Italian and French Wikipedias. All data are expressed as percentages. Anonymous authors are not included in the comparison. text that has just b een added to a page, and thus that has not yet b een vetted by other contributors. Let fresh text b e the text that has just b een inserted in a Wikip edia article. We considered all text that is fresh in all the Italian Wikip edia, and we measured that 3.87 % of this fresh text is deleted in the next revision. In other words, Pr(deleted | fresh) = 0.0387. We then rep eated the measurement for text that is b oth fresh, and is due to a lowreputation author: 6.36 % of it was deleted in the next revision, or Pr(deleted | fresh and low-reputation) = 0.0636. This indicates that author reputation is a useful factor in predicting the survival probability of fresh text, if not directly its trustworthiness. Indeed, as remarked ab ove, since text can b e deleted for a numb er of reasons aside from bad quality, author reputation is most likely a b etter indicator of trustworthiness than these figures indicate. [7] R. Guha, R. Kumar, P. Raghavan, and A. Tomkins. Propagation of trust and distrust. In Proc. of the 13th Intl. Conf. on World Wide Web, pages 403­412. ACM Press, 2004. [8] J. Hunt and M. McIlroy. An algorithm for differential file comparison. Computer Science Technical Rep ort 41, Bell Lab oratories, 1975. [9] S. Kamvar, M. Schlosser, and H. Garcia-Molina. The eigentrust algorithm for reputation management in p2p networks. In Proc. of the 12th Intl. Conf. on World Wide Web, pages 640­651. ACM Press, 2003. [10] J. Kleinb erg. Authoritative sources in a hyp erlinked environment. J. ACM, 46(5):604­632, 1999. [11] V. Livshits and T. Zimmerman. Dynamine: Finding common error patterns by mining software revision histories. In ESEC/FSE, pages 296­305, 2005. [12] D. Lucking-Reiley, D. Bryan, N. Prasad, and D. Reeves. Pennies from Ebay: The determinants of price in online auctions. Working pap er, Vanderbilt University, 1999. [13] D. McGuinness, H. Zeng, P. da Silva, L. Ding, D. Narayanan, and M. Bhaowal. Investigation into trust for collab orative information rep ositories: A Wikip edia case study. In Proceedings of the Workshop on Models of Trust for the Web, 2006. [14] E. Myers. An o(ND) difference algorithm and its variations. Algorithmica, 1(2):251­266, 1986. [15] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical rep ort, Stanford Digital Library Technologies Pro ject, 1998. [16] P. Resnick, R. Zeckhauser, E. Friedman, and K. Kiwabara. Reputation systems. Comm. ACM, 43(12):45­48, 2000. [17] W. Tichy. The string-to-string correction problem with block move. ACM Transactions on Computer Systems, 2(4), 1984. [18] F. Vi´gas, M. Wattenb erg, and K. Dave. Studying e coop eration and conflict b etween authors with history flow visualizations. In Proc. of the SIGCHI Conf. on Human Factors in Computing Systems, pages 575­582, 2004. [19] H. Zeng, M. Alhoussaini, L. Ding, R. Fikes, and D. McGuinness. Computing trust from revision history. In Intl. Conf. on Privacy, Security and Trust, 2006.

5.3 Conclusions
After comparing edit and text longevity values with user quality ratings for revisions, we b elieve that the largest residual source of error in our content-driven reputation lies in the fact that our text analysis does not include sp ecific knowledge of the Wikip edia markup language and Wikip edia conventions. We plan to make the text analysis more precise in future work.

Acknowledgements. We thank Marco Faella for helping
organize the manual rating of revisions to the Italian Wikip edia, and we thank the anonymous reviewers for their insightful and helpful feedback.

6.

REFERENCES

[1] R. Burns and D. Long. A linear time, constant space differencing algorithm. In Performance, Computing, and Communication Conference (IPCCC), pages 429­436. IEEE International, 1997. [2] T. Cover and J. Thomas. Elements of Information Theory. J. Wiley & Sons, 1991. [3] T. Cross. Puppy smoothies: Improving the reliability of op en, collab orative wikis. First Monday, 11(9), Septemb er 2006. [4] W. Cunningham and B. Leuf. The Wiki Way. Quick Col laboration on the Web. Addison-Wesley, 2001. [5] C. Dellarocas. The digitization of word-of-mouth: Promises and challenges of online reputation systems. Management Science, Octob er 2003. [6] J. Golb eck. Computing and Applying Trust in Web-Based Social Networks. PhD thesis, University of Maryland, 2005.

270