WWW 2008 / Refereed Track: Internet Monetization - Sponsored Search April 21-25, 2008 · Beijing, China Online Learning from Click Data for Sponsored Search Massimiliano Ciaramita, Vanessa Murdock, Vassilis Plachouras Yahoo! Research, Ocata 1, Barcelona, 08003, Catalunya, Spain {massi | vmurdock | vassilis}@yahoo-inc.com ABSTRACT Sp onsored search is one of the enabling technologies for today's Web search engines. It corresp onds to matching and showing ads related to the user query on the search engine results page. Users are likely to click on topically related ads and the advertisers pay only when a user clicks on their ad. Hence, it is imp ortant to b e able to predict if an ad is likely to b e clicked, and maximize the numb er of clicks. We investigate the sp onsored search problem from a machine learning p ersp ective with resp ect to three main sub-problems: how to use click data for training and evaluation, which learning framework is more suitable for the task, and which features are useful for existing models. We p erform a large scale evaluation based on data from a commercial Web search engine. Results show that it is p ossible to learn and evaluate directly and exclusively on click data encoding pairwise preferences following simple and conservative assumptions. Furthermore, we find that online multilayer p erceptron learning, based on a small set of features representing content similarity of different kinds, significantly outp erforms an information retrieval baseline and other learning models, providing a suitable framework for the sp onsored search task. Categories and Subject Descriptors H.3 [INFORMATION STORAGE AND RETRIEVAL]; H.3.5 [Online Information Services]: Commercial services; I.5 [PATTERN RECOGNITION]; I.5.1 [Models]: Neural Nets; I.5.2 [Design Methodology]: Classifier design and evaluation General Terms Algorithms, Design, Exp erimentation Keywords Sp onsored search, ranking, online learning, p erceptrons 1. INTRODUCTION Sp onsored search is the task of placing ads that relate to the user's query on the same page as the search results returned by the search engine. Typically sp onsored search results resemble search result snipp ets in that they have a title, and a small amount of additional text, as in Figure 1. Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2008, April 21­25, 2008, Beijing, China. ACM 978-1-60558-085-2/08/04. When the user clicks on the title he is taken to the landing page of the advertiser. Search engine revenue is generated in large part by the sp onsored results. Of the $16.9 billion generated by online advertising in 2006, 40% was generated by search advertising [12]. Clearly it is an advantage to the search engine to provide ads users will click. While the final ranking takes into account the amount bidded by the advertiser and, more generally, the microeconomic model adopted by the search engine to maximize revenue, relevance, or quality of the ads returned is a crucial factor. Users are more likely to click ads that are relevant to their query. In fact, there is evidence that the level of congruence b etween the ad and its context has a significant effect even in the absence of a conscious resp onse such as a click [34]. If we assume that congruency equates to topical similarity or "relevance", the task is to show the ads that are most similar to the user's query in the sp onsored results. With this in mind we would like to place ads in the sp onsored results that are a good match for the user's query. In this pap er we abstract from the economic asp ects of the problem and focus on the issue of improving the quality of the ads prop osed as relevant to the query. Information ab out the quality of match at the content level can b e taken into account later, including the micro-economic model, to compile the final list of ads to b e displayed. While the publisher goal is maximizing profit, we stress the fact that maximizing ads quality or relevance is a crucial factor in this process. Aside from the difficulties in assessing the similarity of an ad to a query that stem from the sparseness of the representation of b oth the query and the ad, the task of placing ads is complicated by the fact that users click on ads for a wide variety of reasons that are not reflected in the similarity of an ad to a query. For example, there is a strong p ositional bias to user clicks. Users are much more likely to click on items at the top of a ranked list of search results than items lower in the ranking [11, 14]. This makes using the click data to learn a ranking function over the ads and to evaluate the system more difficult. Sp ecifically, user clicks are not an indication of absolute relevance. Rather, as Joachims prop osed [13], the user click mainly indicates the clicked item is more relevant than the those items which were ranked higher but were not clicked. This observation implies that p ositive and negative examples can b e extracted from query logs in a meaningful way, avoiding the complexities and noise of click-through rate estimation. Building on Joachims' suggestion we create a training set from the query logs of a real sp onsored search system. We prop ose that this typ e of data can b e also used directly 227 WWW 2008 / Refereed Track: Internet Monetization - Sponsored Search April 21-25, 2008 · Beijing, China Figure 1. Example of sp onsored search advertisements on the search engine results page of Yahoo! for the query "Beijing 2008" . Ads are shown on the right side under the lab el "SPONSOR RESULTS" . for evaluating a learning model. A few studies have already shown that click feedback from users can b e used to improve machine learned ranking [32, 1]. However, this typ e of information has not b een used for b oth training and evaluation. Previous work [32, 1] has ultimately relied on editorial data for evaluation. We show that consistent results are obtained in this way across different ranking methods and different feature sets. We formulate the problem of ranking a set of ads given a query as a learning task and investigate three learning methods of increasing complexity based on the p erceptron algorithm: a binary linear classifier, a linear ranking model and a multilayer p erceptron, or artificial neural net. Sp ecifically, we focus on online learning methods which suit the task of learning from large amounts of data, or from a stream of incoming feedback from query logs. As hop ed, we find that accuracy increases with the complexity of the model. Retrieving ads is complex in part b ecause the text is sparse. In addition, features based on link structure that have b een shown to b enefit retrieval in a Web setting cannot b e applied to ads b ecause they lack an obvious link organization. Exploiting our setup we investigate several classes of features which have b een prop osed recently for content match, the task of ranking ads with resp ect to the context of a Web page, rather than a query. We start from the cosine similarity b etween query and ad. We decomp ose the ad and use as features the similarity of individual comp onents of the ad and the query, and also prop erties of the degree of overlapping b etween the query and the ad [25]. Next we investigate a class of language-indep endent, knowledge free, features based on the distributional similarity of pairs of words which have b een used successfully in content match [6], and can b e extracted from any text collection or query log. These features measure the similarity b etween two texts indep endently of exact matches at the string level and are meant to capture indirect semantic associations. In content match there are many words that can b e extracted from a Web page to compute such features, while in sp onsored search there are only the terms in the query. We show that across all learning methods these features produce the b est results. Overall, with a compact set of just twelve features we improve significantly over all traditional models. This pap er presents the following primary contributions. First, we show that click-data can b e used directly for evaluation purp oses which is a desirable prop erty in the context of large scale systems that otherwise have to rely exclusively on editorial data, or carry out noisy estimations of clickthrough rates. Second, we show empirically that different methods of increasing complexity can b e applied to the task and generate consistent results. This is imp ortant b ecause it supp orts the hyp othesis that the evaluation is consistent across different methods. On the learning side, it also shows that taking into account pairwise information in training is b eneficial in machine-learned ranking, even in noisy settings. Finally, we provide empirical evidence on the utility of a class of simple features for ranking ads based on lexical similarity measures, which has p ossible applications to other query-based ranking tasks such as document retrieval and search in general. The rest of the pap er is organized as follows. In Section 2 we introduce the method for unbiasing click logs. Section 3 introduces several approaches to learning from click data for sp onsored search. Section 4 describ es the features evaluated in our exp eriments, which are discussed in Sections 5 and 6. Section 7 provides an overview of related work. 2. CLICK DATA Employing user clicks to train and to evaluate a sp onsored search system is a natural choice, since the goal in sp onsored search is maximizing the numb er of clicks. However, user clicks cannot b e used in a straight-forward manner b ecause they have a strong p ositional bias [11, 14], and they only provide a relative indication of relevance [13]. There is a strong p ositional bias as highly ranked results or ads may b e clicked b ecause of their rank and not their relevance. For example, a user may click on the top ranked ad and then click on the third ad in the ranking, even if the third ad may b e more relevant to his query. The reason for this bias is that users are likely to sequentially scan the ranked list of 228 WWW 2008 / Refereed Track: Internet Monetization - Sponsored Search April 21-25, 2008 · Beijing, China Ad Ranking a1 a2 a3 a4 a5 a6 Blocks a 2 -1 a 3 +1 a2 a4 a5 a6 -1 -1 -1 +1 such a way one introduces effectively an equivalence of relevance b etween clicked ads (p ositive p oints), while according to our conservative assumption we cannot really say anything ab out the relative relevance of two clicks in different blocks, b ecause we do not have reliable means of interpreting clicks in an absolute way. As a matter of fact, in some of the ranking models we prop ose (classification and multilayer p erceptron) the block structure is effectively ignored in training. However training on local pairwise preferences can b e b eneficial, as we show in the exp erimental section going from a global binary classification model to a ranking model which exploits the block structure. The block-based encoding is compatible with b oth approaches and defines a robust and intuitive learning task. Figure 2. Illustrative example of how blocks are generated from the clicked and non-clicked ads for a query. 3. LEARNING MODELS Learning with clicks can involve arbitrarily large amounts of data, or even learning from a continuous stream of data. Online learning algorithms are the most natural choice for this typ e of task, since the data need not b e considered (or stored in memory) all at once; each pattern is used for learning in isolation. Notice how learning from blocks fits p erfectly online learning in that any feedback can b e immediately used to up date the current model, even when a query or ad is observed for the first time, without the need to accumulate evidence for reliable estimate of click-through rates. As a general online learning framework we choose the p erceptron algorithm. The p erceptron was invented by Frank Rosemblatt in 1958 [26], and was initially criticized b ecause of its inability to solve non-linear problems [22]. In fact, the p erceptron, like supp ort vector machines (SVM) and other methods, can learn non-linear models by means of kernel functions in dual algorithms, or by means of higher-order feature mappings in the primal form, or again (see b elow Section 3.3) by means of multilayer architectures. The p erceptron has received much attention in recent years for its simplicity and flexibility. Among other fields, the p erceptron is p opular in natural language processing, where it has b een successfully applied to several tasks such as syntactic parsing, tagging, information extraction and reranking [7, 29, 30, 31]. We preferred the p erceptron over other p opular methods, such as SVM, for which incremental formulations have b een prop osed [5, 17], b ecause accuratelydesigned p erceptrons (i.e., including regularization, margin functions, etc.) often p erforms as well as more complex methods at a smaller computational cost. Moreover, the simplicity of the algorithm allows easy customization, which is crucial in large scale settings. In Section 3.4, we b enchmarked one of our p erceptron models on a ranking task, yielding results comparable to SVM and Boosting. We investigate three approaches to learning to rank ads based on click data: classification, ranking, and non-linear regression. The general setting involves the following elements. A pattern x IRd is a vector of features extracted from an ad-query pair (a, q ). Each pattern xi is associated with a resp onse value yi {-1, +1}. In classification we associate a vector for a pair which has not b een clicked with -1, also referred to as class y0 , and a vector for a pair which has b een clicked with +1, also referred to as class y1 . The goal of learning is to find a set of parameters, or weights, used to assign a score F (xi ; ) to patterns such that F (xi ; ) is close to the actual value yi . In particular, we are interested in predicting the clicked ad in a block of ads. items and may click on an item b efore scanning the whole list, or may never scan the whole list at all. To investigate how to employ user clicks to train and evaluate a sp onsored search system, we have collected a set of queries and the ads that had b een shown with them on the right-hand side of the search engine results page, from the logs of the Yahoo! Web search engine. We sampled queries until we collected a sufficiently large numb er of distinct clicked ads. We sampled only queries with three or more query terms b ecause longer queries are more likely to lead to higher conversion rates [24]. In other words, users issuing longer queries are more likely to visit a Web site and p erform a transaction. We also considered only one click for a query-ad pair from one user p er day. Following Joachims [13], we make a conservative assumption that a click can only serve as an indication that an ad is more relevant than the ads ranked higher but not clicked, but not as an absolute indication of the ad relevance. In this setting, the clicks on the top ranked ad do not carry any information, b ecause the top ranked ad cannot b e ranked any higher. In other words, there is no discriminative pairwise information associated with a click at p osition one. Hence, we do not consider such clicks in the remainder of our exp eriments. For each clicked ad, we create a block which consists of the clicked ad and the non-clicked ads that ranked higher, for a total of 123,798 blocks. In each block, we assign a score of "+1" to the clicked ad and "-1" to the ads that were ranked higher but were not clicked. Figure 2 shows an example of the score assignment process. On the left-hand side of the figure, we show the ranking of six ads for a query. The ellipses around ads a1 , a3 and a5 denote that these ads were clicked by the user who submitted the query. The "gold-standard" blocks of ads, shown on the right-hand side of the figure, are generated in the following way. First we ignore the click on ad a1 since this ad was already ranked first and it was clicked. Then, we form one block of ads with a2 and a3 , assigning scores of "-1" and "+1", resp ectively. Next, we form a second block of ads consisting of a2 , a4 , a5 with scores "-1" and a6 with score "+1". In Figure 2 blocks of ads are shown with shaded areas. From the example ab ove we obtain two gold standard blocks from one user session. Alternatively, one could generate a single vector of binary values including all the p ositive and negative examples. One p otential problem is that in 229 WWW 2008 / Refereed Track: Internet Monetization - Sponsored Search April 21-25, 2008 · Beijing, China 3.1 Classification In a classification framework the goal is to learn a function which is able to accurately assign a pattern to either the clicked or non-clicked class. Patterns in the data are used indep endently of one another in training and the classifier simply finds a weight vector which assigns each pattern to the correct class1 . After training, the classifier can b e used to rank ads given a query; e.g., to identify the most likely clickable pattern in a block. The basic classifier is a binary p erceptron. We extend the basic model by averaging and adding an uneven margin function. Averaging is a method for regularizing the classifier by using ­ for prediction after training ­ the average weight vector of all p erceptron models p osited during training [10]. The uneven margin function is a method for learning a classifier with large margins for cases in which the distribution of classes is unbalanced [20]. Since non-clicked ads are more numerous than clicked ads we face an unbalanced learning task. The uneven margin function pushes learning towards achieving a larger margin on the p ositive class. The binary p erceptron uses the sign function as a discriminant: F (x; ) = Sgn( x, ) (1) Figure 3. Examples of decision b oundaries that can b e learned by the linear, on the left, and non-linear models, on the right, in a two-dimensional case. X denotes p ositive (clicked) patterns while circles denote negative (non-clicked) patterns. The linear classifier depicted on the left correctly classifies the p ositive examples, but misclassifies a negative one. The non-linear classifier can find complex decision b oundaries to solve such non-linearly separable cases. xj ) is computed. Given a margin function g and a p ositive learning margin , if Srnk (xi - xj ) g (r (yi ), r (yj )) , an up date is made as follows: t+1 = t + (xi - xj )g (r (yi ), r (yj )) (5) We learn from the training data. The model has two adjustable parameters. The first is the numb er of instances T to use in training, or the numb er of passes (ep ochs) over the training data. The second concerns the uneven margin function, for which we define a constant 1 used in training to define a margin on the p ositive class. While training, an error is made on a p ositive instance x if F (x; ) 1 ; the parameter on the negative class 0 = 0 and is effectively ignored. The learning rule is: t+1 = t + yi xi (2) In particular, b ecause the discriminant function is an inner product, Srnk (xi - xj ) = Srnk (xi ) - Srnk (xj ). By default we use g (i, j ) = ( 1 - 1 ) as a margin function as suggested i j in [30]. Although there are only two p ossible ranks in our setting, the hop e is that training on pairs provides more information than training on patterns in isolation. For regularization purp oses, averaging is applied also to the ranking p erceptron. 3.3 Multilayer Regression One p ossible drawback of the previous methods is that they are limited to learning linear solutions. To improve the expressive p ower of our ranking functions, within the online p erceptron approach, we exp erimented also with multilayer models. The top ology of multilayer p erceptrons includes at least one non-linear activation layer b etween the input and the output layers. Multi-layer networks with sigmoid non-linear layers can generate arbitrarily complex contiguous decision b oundaries [8], as shown in Figure 3, and have b een used successfully in several tasks, including learning to rank [3]. The multilayer p erceptron is a fully connected three-layer network with the following structure: 1. Input layer: d units x1 , x2 , .., xd + a constant input x0 = 1 2. Hidden layer: nH units w1 , w2 , .., wnH + a constant weight w0 = 1 3. Output layer: one unit z 4. Weight vector: 2 IRnH + a bias unit 2 0 5. Weight matrix: 1 IRd×nH + a bias vector 1 0 IRnH The score Smlp (x) of a pattern x is computed with a feedforward pass: Smlp (x) = z = nH X j =1 The ranking function defined on the binary classifier is simply the inner product b etween the pattern and the weight vector: Sapm = x, (3) In evaluation, we use this score to rank ads in each block. 3.2 Ranking Another way of modeling click feedback is directly as a ranking problem. For this purp ose we implement the ranking p erceptron prop osed by Shen and Joshi [30], which reduces the ranking problem to a binary classification problem. The general ob jective is to exploit the pairwise preferences induced from the data by training on pairs of patterns, rather than indep endently on each pattern. Let Rb b e a set of pairs of patterns for a block b, such that (xi , xj ) Rb r (yi ) < r (yj ), where r (yi ) is the rank of xi in b; i.e., in our case, either yi = 1 and r (yi ) = 1, or yi = -1 and r (yi ) = 2. Given a weight vector , the score for a pattern x is again the inner product b etween the pattern and the weight vector: Srnk = x, (4) However, the error function dep ends on pairwise scores. In training, for each pair (xi , xj ) Rb , the score Srnk (xi - Under the assumption that the training data is separable, i.e., there exists an hyp erplane which correctly classifies all data p oints. 1 2 wj + 2 = 2 , w j 0 (6) 230 WWW 2008 / Refereed Track: Internet Monetization - Sponsored Search where wj = f (netj ), and netj = d X i=1 April 21-25, 2008 · Beijing, China has a value of one if all of the query terms are present in the ad: 1j xi + 1 = 1 , x i 0 j (7) if (t q )t a, F1 = 1, and 0 otherwise. (12) The activation function f (.) of the hidden unit is the sigmoid: 1 f (net) = . (8) 1 + exp-a net Sup ervised training b egins with an untrained network with randomly initialized parameters. Training is carried out with backpropagation [27]. An input pattern xi is selected, its score computed with a feedforward pass and compared to the true value yi . Then the parameters are adjusted to bring the score closer to the actual value of the input pattern. The error E on a pattern xi is the squared difference b etween the guessed score Smlp (xi ) and the actual value yi of xi , or for 1 brevity (yi - si ); thus E = 2 (yi - si )2 . After each iteration t, is up dated comp onent-wise to t+1 by taking a step in weight space which lowers the error function: t+1 = t + t E = t - t (9) The second feature has a value of one if some of the query terms are present in the ad: if t q such that t a, F2 = 1, and 0 otherwise. (13) The third feature has a value of one if none of the query terms are present in the ad: if ¬t q such that t a, F3 = 1, and 0 otherwise. (14) The fourth feature is the p ercentage of the query terms that have an exact match in the ad materials. Prior to computing the features, b oth the query and the ad were normalized for case. For the purp ose of word overlap, we chose to stem and stop less aggressively than we would do with functions that are smoothed. For this reason we used a Krovetz stemmer [18], and only single characters were removed. 4.2 Cosine similarity We compute the cosine similarity sim(q , a) b etween the query q and the ad a as follows: P tq a wq t wat qP sim(q , a) = qP (15) 2 2 tq wq t ta wat where the weight wt of a term in q or a corresp onds to the tf - idf weight: wt = tf · log2 N +1 nt + 0.5 (16) where is the learning rate, which affects the magnitude, or sp eed, of the changes in weight space. The weight up date for the hidden-to-output weights is: 2 = - wi i (10) where = (yi -zi ). The learning rule for the input-to-hidden weights is: 1j = - xj f (netj )1j . i i (11) where f is the derivative of the non-linear activation function. 3.4 Benchmark on a public dataset A full comparison of the methods presented here with other methods on information retrieval tasks is b eyond the scop e of the pap er. However, to get an estimate for the accuracy of the methods we implemented, we evaluated the ranking p erceptron (see Section 3), on the Letor dataset [21]. On all evaluation metrics the ranking p erceptron achieves scores comparable to SVM on the OHSUMED and TD2003 datasets, and comparable to RankBoost on TD2004. Thus in line with the b est method in each dataset. The multilayer p erceptron outp erforms the ranking p erceptron on exploratory runs, but we did not carry out extensive comparisons in this context. 4. FEATURES A range of features, from simple word overlap and textual similarity features to statistical association b etween terms from the query and the ads, are used for learning a ranking model. In particular, we are interested in finding features which helps in the face of sparse data, as ads are characterized by small amounts of text. In Equation (16), tf is the frequency of a term in q or in a. When considering queries q , tf is exp ected to b e uniformly distributed with one b eing the most likely value, b ecause terms are not likely to b e rep eated in queries. In addition, N corresp onds to the total numb er of available ads and nt corresp onds to the numb er of ads in which term t occurs. The tf - idf weight wat of term t in a is computed in the same way. We have also computed the cosine similarity b etween q and each of the fields of the ads, that is, the ad title at , the ad description ad and its bidded terms ab . In all cases, we have applied Porter's stemming algorithm and we have removed stop words. Cosine similarity has b een used effectively for ranking ads to place on Web pages in the setting of contextual advertising [25, 6]. A difference with our setting is that in the case of contextual advertising, the cosine similarity is computed b etween the Web page and ad. While there are more complex similarity functions that have b een develop ed and applied for the case of computing the similarity b etween short snipp ets of text [28], we use cosine similarity b ecause it is parameter free and inexp ensive to compute. 4.3 Correlations Queries and ads are b oth short snipp ets of text, which tend not to have a high vocabulary overlap. To address this issue, and exploit additional information from non-matching terms, we consider two features based on measuring the statistical association of terms from an external corpus. 4.1 Word Overlap We compute four features that assess the degree of overlap b etween the query and the ad materials. The first feature 231 WWW 2008 / Refereed Track: Internet Monetization - Sponsored Search Feature Name NoKey SomeKey AllKey PercentKey Abbrev. April 21-25, 2008 · Beijing, China O Ad Title Description Bidterm AvePMI MaxPMI CSQ B F P C Description Word Overlap Features 1 if no query term is present in the ad materials; 0 otherwise 1 if at least one query term is present in the ad materials; 0 otherwise 1 if every query term is present in the ad materials; 0 otherwise The numb er of query terms present in the ad materials divided by the numb er of query terms Cosine Similarity Features The cosine similarity b etween the query and the ad materials (baseline) The cosine similarity b etween the query and the ad title The cosine similarity b etween the query and the ad description The cosine similarity b etween the query and the bidded terms Correlation Features The average p ointwise mutual information b etween terms in the query and terms in the ad The maximum p ointwise mutual information b etween terms in the query and terms in the ad Numb er of query-ad term pairs that have a 2 statistic in the top 5% of computed 2 values. Table 1. Summary of features. The column "Abbrev." provides an abbreviated name for one or more features, as they will b e used in the exp eriments. 4.3.1 Pointwise Mutual Information One measure of association b etween terms is p ointwise mutual information (PMI). We compute PMI b etween terms of a query q and the bidded terms of an ad a. PMI is based on co-occurrence information, which we obtain from a set of queries submitted to the Yahoo! search engine: P M I (t1 , t2 ) = log2 P (t 1 , t 2 ) P (t 1 )P (t 2 ) (17) t2 ¬t2 t1 o11 o21 ¬t1 o12 o22 Table 2. Definition of oij for the calculation of the 2 statistic in Equation 18. Part 1 2 3 4 5 Development size 1358 1517 1400 1408 1410 Test size 1445 1369 1488 1514 1329 where t1 is a term from q , and t2 is a bidded term from the ad a. P (t) is the probability that term t app ears in the query log, and P (t1 , t2 ) is the probability that terms t1 and t2 occur in the same query. We form the pairs of t1 and t2 by extracting the query terms and the bidded terms of the ad. We only consider pairs of terms consisting of distinct terms with at least one letter. For each pair (q , a) we use two features: the average PMI and the maximum PMI, denoted by AvePMI and MaxPMI, resp ectively. Table 3. The numb er of blocks in each of the development and test partitions of the data. as: z= xi - µi . i (19) 4.3.2 2 Statistic Another measure of association b etween terms is the 2 statistic, which is computed with resp ect to the occurrence in a query log of terms from a query, and the bidded terms of an ad: 2 = |L|(o11 o22 - o12 o21 )2 (o11 + o12 )(o11 + o21 )(o12 + o22 )(o21 + o22 ) (18) where |L| is the numb er of queries in the query log, and oij are defined in Table 2. For example o11 stands for the numb er of queries in the log, which contain b oth terms t1 and t2 . Similarly, o12 stands for the numb er of queries in the log, in which term t2 occurs but term t1 does not. We compute the 2 statistic for the same pairs of terms on which we compute the PMI features. Then, for each query-ad pair, we count the numb er of term pairs that have a 2 higher than 95% of all the computed 2 values. An overview of the features we use is shown in Table 1. All feature values were normalized by column across the entire dataset with a z - scor e, in order to have zero mean and unit standard deviation, thus each feature xi was standardized In addition we augmented each data vector with a bias feature which has a value of one for every example, and serves as a prior on the resp onse variable. Before continuing, it is imp ortant to observe that all the features we describ ed in this section can b e computed efficiently. For example, the word overlap and cosine similarity features can b e computed online, when a user query is received. The correlation features can also b e computed online efficiently by using a look-up table with word co-occurrence information for pairs of words. 5. EXPERIMENTAL SETUP We split the dataset into one training set, 5 development sets and 5 test sets, so that all the blocks for a given query are in the same set. The exact numb er of blocks for each of the development and test sets is given in Table 3. The training set consists of 109,560 blocks. A ranking algorithm produces a score for each query-ad pair. The ads are ranked according to this score. Because of the way the data is constructed to account for the relative p osition of clicks, each block has only one click associated 232 WWW 2008 / Refereed Track: Internet Monetization - Sponsored Search Feature set B BO BF BFO BFOP BFOC BFOCP Classification Prec at 1 MRR 0.322 0.582 ±0.306 0.319 0.578 ±0.306 0.341 0.593 ±0.309 0.357 0.605 ±0.311 0.357 0.604 ±0.311 0.351 0.601 ±0.310 0.360 0.606 ±0.311 Prec at 1 0.333 0.352 0.347 0.357 0.359 0.364 0.363 Ranking MRR 0.590 ±0.307 0.602 ±0.310 0.597 ±0.310 0.605 ±0.311 0.606 ±0.311 0.610 ±0.311 0.609 ±0.311 April 21-25, 2008 · Beijing, China Regression Prec at 1 MRR 0.328 0.585 ±0.307 0.343 0.596 ±0.309 0.374 0.615 ±0.314 0.371 0.614 ±0.313 0.374 0.617 ±0.313 0.381 0.619 ±0.315 0.388 0.624 ±0.315 Table 4. The results for classification, ranking and regression, computed over all trials. The b est result is indicated in b old. Results that are statistically significant with resp ect to the baseline are indicated with a star. Results indicated with a dagger are statistically significant with resp ect to the features B + F + O. The results for precision at one were not tested for statistical significance. Classification Prec at 1 MRR 0.322 ±0.008 0.582 ±0.003 0.339 ±0.020 0.591 ±0.012 0.340 ±0.016 0.592 ±0.007 0.356 ±0.007 0.604 ±0.004 0.359 ±0.008 0.606 ±0.005 0.350 ±0.011 0.600 ±0.009 0.357 ±0.014 0.605 ±0.008 Ranking Prec at 1 MRR 0.333 ±0.014 0.590 ±0.006 0.352 ±0.010 0.602 ±0.005 0.345 ±0.007 0.596 ±0.004 0.359 ±0.006 0.605 ±0.003 0.361 ±0.010 0.607 ±0.007 0.365 ±0.007 0.611 ±0.003 0.364 ±0.006 0.609 ±0.003 Regression Prec at 1 MRR 0.331 ±0.020 0.586 ±0.012 0.351 ±0.016 0.600 ±0.011 0.368 ±0.013 0.611 ±0.007 0.375 ±0.016 0.616 ±0.008 0.372 ±0.015 0.614 ±0.008 0.381 ±0.010 0.619 ±0.005 0.387 ± 0.009 0.622 ±0.004 Feature set B BO BF BFO BFOP BFOC BFOCP Table 5. The average of five trials, for classification, ranking and regression. The standard deviation refers to the 5 trials, not the standard deviation within a given trial. The b est result is shown in b old. with it. For this reason we evaluate the precision at rank one, which tells us how many clicked ads were placed in the first p osition by the ranker, and the mean reciprocal rank, which tells us the average rank of the clicked ad in the output of the ranker. The mean reciprocal rank is computed as: 1 M RR = n n X i=1 e.g., the sigmoid parameter a = 1.716 [8]. Following [8] we initialized the network weights for the hidden-to-output units uniformly at random in the interval - 1 < 2 < i (nH ) 1 , (nH ) the input-to-hidden weights were initialized ran( d) 2 < ij < 1 . On the de( d) 1 r anki domly in the interval - 1 (20) where r anki is the rank of the clicked ad according to the ranking function score and n is the total numb er of blocks. The MRR score gives an indication of how far on average the ranker's guess is in the ranked list, thus providing a smoother measure than precision at one. velopment data we found that hidden layers with 50 units, and = 0.01, produced fast training and stable results, we kept these values fixed on all exp eriments involving the multilayer model. The numb er of iterations was set on the development set, running a maximum of 50 iterations2 . 6. RESULTS The baseline model has only one feature, the cosine similarity b etween the ad and the query with tf - idf weights. Table 4 shows the results for classification, ranking, and multilayer regression for each of the five test sets concatenated. That is, for the five test folds evaluated as one dataset, in order to compute the significance of the mean reciprocal rank results. For mean reciprocal rank we used a paired t-test. Results indicated with a star are significant at the p < 0.05 level with resp ect to the baseline. Most of the significant results are significant at the p < 0.01 level with resp ect to the baseline. The precision at one results were not tested for statistical significance. The significance for this metric is not computed b ecause it is not appropriate for binary data. We see that multilayer regression outp erforms b oth classification and ranking, and further that the correlation features are a significant improvement over the other models. For one third of the examples in the evaluation, the predictor correctly identifies that the first result was clicked, and an MRR of 0.60 indicates that on average the clicked result was b etween rank one and rank two. 2 One full iteration of the MLP model takes ab out 10 seconds on a desktop machine. 5.1 Model selection All adjustable parameters of the learning models were fixed on the development datasets. The b est values were selected by monitoring the average accuracy over the 5 development folds, the optimal values on development were used on the evaluation set. We trained all models with a stochastic protocol, choosing a training instance at random without replacement: a block for the ranking case, a single pattern for the classification and multilayer models. In the classification case we set the parameters T and . We tried three values for 1 , 1, 10 and 100, and found 100 to give the b est results. As for the numb er of iterations, we found that all models (not only in classification) tended to converge quickly, rarely requiring more than 20 iterations to find the b est results. In the ranking model, in addition to the numb er of iterations T we optimize the p ositive learning margin . We obtained the b est results with the value = 1, which we used in all exp eriments with ranking p erceptron. The multilayer model has a numb er of adjustable parameters, some of the parameters were kept with default values; 233 WWW 2008 / Refereed Track: Internet Monetization - Sponsored Search Although the standard deviation is high, around 0.3 for each model, this is to b e exp ected b ecause MRR can take values of 1, 0.5, 0.34, 0.25, 0.20, 0.18, etc. An average MRR of 0.6 indicates that most clicked ads were at p ositions 1, 2 and 3. If half of the results had an MRR of 1, and the other half had an MRR of 0.34, the mean would b e 0.67, resulting in a standard deviation of 0.34. Thus the high standard deviation in this case is an artifact of the metric. We also compute the averages and standard deviation across the five test sets, shown in Table 5. As indicated by the standard deviation for the trials, the method is robust to changes in the data set, even for precision at 1 which is in general a much less stable evaluation metric. As already shown for content match [25, 19, 6] weighting the similarity of each comp onent separately and adding features ab out the degree of overlapping b etween query and ad improve significantly over the baseline. The b est result for each model are achieved adding the term correlation features. April 21-25, 2008 · Beijing, China 6.1 Discussion Sp onsored search click data is noisy, p ossibly more than search clicks. People, and fraudulent software, might click on ads for reasons that have nothing to do with topical similarity or relevance. While it is not obvious that we can learn to distinguish relevant from non-relevant ads based on a user click, we find that there is enough signal in the clicks that, with a simple method for unbiasing the rank of the click [13], it is p ossible to learn and carry out meaningful evaluation without the need for editorial judgments produced manually, or complex estimation of click-through rates. Arguably, evaluating a classifier on the task of identifying the ad which will b e clicked is more directly related to the task of successfully ranking ads then guessing indirectly the relevance assigned by humans. However, since the two are strongly related it makes sense to investigate their interaction, in particular in learning. An interesting question is if, in the presence of b oth typ es of information, it might b e b eneficial to solve b oth tasks at the same time. This typ e of approach in fact fits well the artificial neural net learning methods [8] and would provide an interesting extension of the multilayer model. The non-linear multilayer p erceptron outp erforms b oth linear models. Interestingly, b oth linear models when using also the correlation features seem to p erform b etter when only use one (PMI or chi-squared) rather than b oth, (see Table 5). This might dep end on the fact that the features are strongly correlated and the linear classifier does not p osses enough information to prefer one over the other in case of disagreements. Thus it finds a b etter solution by trusting always one over the other. The non-linear model instead has enough expressive p ower to capture subtler interactions b etween features and achieves the b est results using b oth features. Another interesting asp ect is that, although there are only two p ossible rankings, and thus the problem b oils down to a binary classification task, the linear ranking p erceptron outp erforms the simpler classifier. The difference seems to lie in the way training is p erformed, by considering pairwise of patterns. Previous work on learning to rank [3] has prop osed methods for training on pairs of patterns in multilayer architectures. This is a natural direction for further investigation of this typ e of model. In terms of the features, even the simple word overlap features produced statistically significant results over the base- line model. Since we are effectively re-ranking ad candidates retrieved by a retrieval system which we treat as a black b ox, candidates are biased by the initial ad placement algorithm, and it is p ossible that the initial retrieval system preferred ads with a high degree of lexical overlap with the query, and the word overlap features provided a filter for those ads. The correlation features, which capture related terms rather than matching terms, added a significant amount of discriminative information. Such features are particularly promising b ecause they are effectively language-indep endent and knowledge free. Similar statistics can b e extracted from many resources simple to compile, or even generated by a search engine. Overall these findings suggest b oth that relevant ads contain words related to the query, and that related terms can b e captured efficiently with correlation measures such as p ointwise mutual information and the chi-squared statistic. There are several opp ortunities for further investigation of this typ e of features, for example by machine translation modeling [23]. One limitation of the current way of modeling click data is that "relevance" judgments induced by the logs are strictly binary. We have seen that using pairwise information is useful in training and it may b e desirable to generate more complex multi-valued feedback. Joachims et al. [14] prop osed manipulating the presentation of candidates to users by swapping the order of contiguous candidates to obtain likelihood ratios. An interesting question is how such information might b e automatically extracted from query logs. 7. RELATED WORK Sp onsored search can b e thought of as a document retrieval problem, where the ads are the "documents" to b e retrieved given a query. As a retrieval problem, sp onsored search is difficult b ecause ad materials contain very few terms. Because the language of the ads is so sparse, the vocabulary mismatch problem is even more critical. Jones et al. [15] approach the problem of vocabulary mismatch by generating multiple rewrites of queries to incorp orate related terms. In their system, related terms are derived from user sessions in the query logs, where query rewrites have b een identified. The set of p ossible rewrites is constrained to contain only terms that are found in the database of advertising keywords. They use a machine-learned ranking to determine the most relevant rewrite to match against the ads. In a follow on to this work, Zhang et al. [35] use active learning to select the examples to use in training machine-learned ranking. Both systems were evaluated on manual editorial judgments. By contrast our system uses click data b oth for training and evaluating the system. Furthermore, our models learn a ranking over the ads given a query directly, rather than learning a ranking over query rewrites. Advertisements are represented in part by their keywords. In one model of Online advertising, ads are matched to queries based on the keywords, and advertisers bid for the right to use the keywords to represent their product. So a related task is keyword suggestion, which can b e applied to sp onsored search or to its sister technology, contextual advertising, which places an ad in a Web page based on the similarity b etween the ad and the Web page content. Carrasco et al. [4] approach the problem of keyword suggestion by clustering bi-partite advertiser-keyword graphs. Yih et al. [33] extract keywords from Web pages using features such as the term frequency, the inverse document frequency, 234 WWW 2008 / Refereed Track: Internet Monetization - Sponsored Search and the presence of terms in the query logs. They stop short of using the keywords in an application. Although we do not consider economic factors in our work, implicit in the task of ranking ads to improve the relevance at the top of the list, is the desire to increase user clicks. Feng et al. [9] investigate the effects of four different ranking mechanisms for advertisements in sp onsored search on revenue. Two of the mechanisms dep end wholly on the bid price of the terms. Of the remaining two, one prop oses ranking ads by their exp ected clickthrough rate, and the other by a function of the bid price and the exp ected clickthrough rate. The focus of this work is on the economic impact of the advertising placement mechanism, and uses a simulated environment which treats as a variable the degree of correlation b etween the advertiser's willingness to pay for advertising slots, and the amount of traffic they attract. They find that ranking ads by a function of the bid price and the clickthrough rate outp erforms other ranking mechanisms. They further explore using the clickthrough rate to dynamically up date the ranking mechanism. They investigate two weighting schemes for user clicks. In the unweighted scheme, an ad's clickscore is increased by 1 every time it is clicked. In the weighted scheme the ad's clickscore is increased by i where i is the rank of the clicked ad, which has the effect of damping the scores of clicks in the top p ositions, and b oosting the scores of ads in lower p ositions, which is one way of addressing the p ositional bias in user clicks. There are several key differences b etween this work and ours. First, the evaluation metric in Feng et al. is the exp ected revenue, whereas we evaluate the relevance of the ads. The work by Feng et al. is done in a simulated environment, whereas ours is p erformed on data from the search engine. They use the clickthrough data as a measure of absolute relevance, rather than as a measure of relevance relative to other ads that were not clicked. Finally, and most imp ortantly, they are assessing the intrinsic relevance of an ad, indep endent of the context of the user's query. Contextual advertising is a sister technology to sp onsored search, and many of the techniques used to place ads in Web pages may b e used to place ads in resp onse to a user's query. As with sp onsored search, contextual advertising is usually a pay-p er-click model, and the ad representations are similar in b oth sp onsored search and contextual advertising. The primary difference is that rather than matching an ad to a query, the system matches the ad to a Web page. Contextual advertising also suffers from the vocabulary mismatch problem. To comp ensate for this, Rib eiro-Neto et al. [25] augment the representation of the target page using additional Web pages. As a follow-on to this work, Lacerda et al. [19] select ranking functions using genetic programming, maximizing the average precision on the training data. Broder et al. [2] use a semantic taxonomy to find topically similar terms, to facilitate the match b etween an ad and a Web page. The current pap er is a follow-on to work in contextual advertising, presented in [6] and [23]. Key differences in the current work are the use of click data in place of human edited relevance judgments, b oth for learning a ranking function and for evaluation, the application to sp onsored search rather than content match, and the use of several different typ es of classifiers. Joachims [13] prop oses that user clicks can b e used to learn ranking functions for search engine results by consid- April 21-25, 2008 · Beijing, China ering that a user click is an indicator of relative relevance. That is, a click at rank j indicates that the document at rank j is more relevant than unclicked documents that preceded it in the ranked list. We adopt this ranking mechanism directly in our work, as describ ed in Sections 2 and 5, however in addition to using it for training, we also use this ranking mechanism for evaluation. Also, Joachims [13] is the first in a series of work investigating using implicit user feedback, such as user clicks, for learning ranking functions. An overview of using implicit user feedback as surrogates for editorial relevance judgments can b e found in [16]. 8. CONCLUSION In this pap er we investigated an approach to learning and evaluating sp onsored search ranking systems based exclusively on click-data and a simple conservative unbiasing method, which fits together well with online learning protocols. In particular, we focused on modeling the textual content which is a fundamental asp ect of the ad selection problem. We found empirically that our approach produces consistent results across different learning models, of varying complexity, and across different feature representations. We found that learning on pairs of patterns is b eneficial and that multilayer p erceptrons provide a comp etitive platform for ranking from noisy data and compact feature representations. We also showed how simple and efficient semantic correlation features provide a valuable source of discriminative information in a complex task such as sp onsored search, characterized by sparse textual descriptions. Interesting avenues for further research might involve other ranking tasks where ob jects have small textual descriptions, such as image and multimedia retrieval, more sophisticated multi-layer ranking functions trained on pairwise preferences, and other variants of the simple unbiasing method used here. 9. REFERENCES [1] E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorp orating user b ehavior information. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, 2006. [2] A. Broder, M. Fontoura, V. Josifovski, and L. Riedel. A semantic approach to contextual advertising. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 559­566. ACM Press, 2007. [3] C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, and G. Hullender. Learning to rank using gradient descent. In Proceedings of the 22nd International Conference on Machine Learning (ICML), pages 89­96, 2005. [4] J. Carrasco, D. Fain, K. Lang, and L. Zhukov. Clustering of bipartite advertiser-keyword graph. In Proceedings of the Workshop on Clustering Large Datasets, IEEE Conference on Data Mining. IEEE Computer Society Press, 2003. [5] G. Cauwenb erghs and T. Poggio. Incremental and decremental supp ort vector machine learning. In 235 WWW 2008 / Refereed Track: Internet Monetization - Sponsored Search Advances in Neural Information Processing Systems, pages 409­415, 2000. M. Ciaramita, V. Murdock, and V. Plachouras. Semantic associations for contextual advertising. International Journal of Electronic Commerce Research - Special Issue on Online Advertising and Sponsored Search, 9(1), pages 1­15, 2008. M. Collins and B. Roark. Incremental parsing with the p erceptron algorithm. In Proceedings of ACL 2004, 2004. R. Duda, P. Hart, and D. Stork. Pattern Classification (2nd ed.). Wiley-Interscience, 2000. J. Feng, H. Bhargava, and D. Pennock. Implementing sp onsored search in web search engines: Computational evaluation of alternative mechanisms. INFORMS Journal on Computing, 19(1):137­148, 2007. Y. Freund and R. Schapire. Large margin classification using the p erceptron algorithm. Machine Learning, 37(3):277­296, 1999. L. Granka, T. Joachims, and G. Gay. Eye-tracking analysis of user b ehavior in WWW search. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 478­479, 2004. IAB. Internet advertising revenue rep ort: 2006 full-year results, 2007. http://www.iab.net/resources/. T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of the ACM Conference on Know ledge Discovery and Data Mining (KDD). ACM, 2002. T. Joachims, L. Granka, B. Pang, H. Hembrooke, and G. Gay. Accurately interpreting clickthrough data as implicit feedback. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 154­161, 2005. R. Jones, B. Rey, O. Madani, and W. Greiner. Generating query substitutions. In Proceedings of the 15th International World Wide Web Conference (WWW), 2006. D. Kelly. Implicit feedback: Using b ehavior to infer relevance. In New Directions in Cognitive Information Retrieval, pages 169 ­ 186. Springer Publishing, 2005. J. Kivinen, A. Smola, and R. Williamson. Online learning with kernels. In Advances in Neural Information Processing Systems, pages 785­792, 2001. R. Krovetz. Viewing morphology as an inference process. In Proceedings of the 16th International ACM SIGIR Conference on Research and Development in Information Retrieval, 1993. A. Lacerda, M. Cristo, M. Goncalves, W. Fan, N. Ziviani, and B. Rib eiro-Neto. Learning to advertise. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 549­556. ACM Press, 2006. Y. Li, H. Zaragoza, R. Herbrich, J. Shawe-Taylor, and J. Kandola. The p erceptron algorithm with uneven margins. In Proceedings of the Nineteenth International Conference on Machine Learning (ICML), pages 379­386, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc. April 21-25, 2008 · Beijing, China [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] T. Liu, J. Xu, T. Qin, W. Xiong, and H. Li. Letor: Benchmark dataset for research on learning to rank for information retrieval. In Proceedings of the SIGIR 2007 Workshop on Learning to Rank for Information Retrieval, 2007. [22] M. Minsky and S. Pap ert. Perceptrons. MIT Press, Cambridge, MA, 1969. [23] V. Murdock, M. Ciaramita, and V. Plachouras. A noisy channel approach to contextual advertising. In Proceedings of the 1st International Workshop on Data Mining and Audience Intel ligence for Advertising (ADKDD'07), pages 21­27, 2007. [24] OneUpWeb. How keyword length affects conversion rates, 2005. http://www.oneupweb.com/landing/ keywordstudy landing.htm. [25] B. Rib eiro-Neto, M. Cristo, P. Golgher, and E. D. Moura. Imp edance coupling in content-targeted advertising. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 496­503. ACM Press, 2005. [26] F. Rosemblatt. The p erceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6):386­408, 1958. [27] D. Rumelhart, G. Hinton, and R. Williams. Learning internal representation by backpropagating errors. Nature, 323(99):533­536, 1986. [28] M. Sahami and T. D. Heilman. A web-based kernel function for measuring the similarity of short text snipp ets. In Proceedings of the 15th international conference on World Wide Web, pages 377­386, New York, NY, USA, 2006. ACM. [29] F. Sha and F. Pereira. Shallow parsing with conditional random fields. In Proceedings of Human Language Technology and North-American Chapter of the Association for Computational Linguistics (HLT-NAACL), 2003. [30] L. Shen and A. Joshi. Ranking and reranking with p erceptron. Machine Learning. Special Issue on Learning in Speech and Language Technologies, 60(1-3):73­96, 2005. [31] M. Surdeanu and M. Ciaramita. Robust information extraction with p erceptrons. In Proceedings of NIST Automatic Content Extraction Workshop (ACE), 2007. [32] G.-R. Xue, H.-J. Zeng, Z. Chen, Y. Yu, W.-Y. Ma, W. Xi, and W. Fan. Optimizing web search using web click-through data. In Proceedings of the thirteenth ACM international conference on Information and know ledge management (CIKM), pages 118­126, New York, NY, USA, 2004. ACM. [33] W. Yih, J. Goodman, and V. Carvalho. Finding advertising keywords on web pages. In Proceedings of the 15th international conference on World Wide Web, pages 213­222, 2006. [34] C. Y. Yoo. Preattentive Processing of Web Advertising. PhD thesis, University of Texas at Austin, 2006. [35] W. V. Zhang, X. He, B. Rey, and R. Jones. Query rewriting using active learning for sp onsored search. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, 2007. 236