Learning to Advertise An´sio Lacerda1 i Weiguo Fan2 1 Marco Cristo1 Nivio Ziviani1 2 Marcos Andre Goncalves1 ¸ ´ Ber thier Ribeiro-Neto1 3 3 Federal Univ. of Minas Gerais Dept. of Computer Science Belo Horizonte, Brazil {anisio, marco, mgoncalv, nivio, berthier}@dcc.ufmg.br Virginia Tech Dept. of Computer Science Blacksburg VA, USA wfan@vt.edu Google Engineering Belo Horizonte Belo Horizonte, Brazil berthier@google.com ABSTRACT Content-targeted advertising, the task of automatically associating ads to a Web page, constitutes a key Web monetization strategy nowadays. Further, it introduces new challenging technical problems and raises interesting questions. For instance, how to design ranking functions able to satisfy conflicting goals such as selecting advertisements (ads) that are relevant to the users and suitable and profitable to the publishers and advertisers? In this paper we propose a new framework for associating ads with web pages based on Genetic Programming (GP). Our GP method aims at learning functions that select the most appropriate ads, given the contents of a Web page. These ranking functions are designed to optimize overall precision and minimize the number of misplacements. By using a real ad collection and web pages from a newspaper, we obtained a gain over a stateof-the-art baseline method of 61.7% in average precision. Further, by evolving individuals to provide good ranking estimations, GP was able to discover ranking functions that are very effective in placing ads in web pages while avoiding irrelevant ones. 1. INTRODUCTION The Internet has become one of the most important media for advertising nowadays. It represents the possibility of global exposure to large audiences at very low cost, which attracts great sums in investments in advertising. This situation was different just few years ago, when the failure of many Web companies led to a dropping in supply of cheap venture capital and considerable reduction in on-line advertising investments [29, 30]. According to the Interactive Advertising Bureau (IAB) [18], such reduction caused consecutive declines in quarterly revenues of companies in the US market, beginning with the first quarter of 2001. However, this loss trend has been reversed by the end of 2002. This recovery has coincided with the increasing adoption of a particular Web advertising format, the search advertising. Nowadays, this is the leading format and, by 2010, it will represent a market of more than US$11 billion [23], according to Forrester Research pro jections. In search advertising, an advertiser company is given prominent positioning in ad lists in return for a placement fee. Because of this, such methods are called paid placement strategies. The most popular paid placement strategy is a nonintrusive technique called Keyword-targeted advertising [30]. In this technique, keywords extracted from the user's search query are matched against keywords associated with ads provided by advertisers. A ranking of the ads, which also takes into consideration the amount that each advertiser is willing to pay, is computed. The top ranked ads are displayed in the search result page together with the answers for the user's query. The success of keyword-targeted advertising has motivated information gatekeepers to offer their ad services in different contexts. For example, relevant ads could be shown to users directly in the pages of information portals. The motivation is to take advantage of the users immediate information interests at browsing time. The problem of matching ads to a Web page that is browsed, which we refer to as Content-targeted advertising [21], is different from that of keyword-targeted advertising. In this case, instead of dealing with users' keywords, we have to use the contents of a Web page to decide which ads to display. A previous work in literature [28] has shown that the use of different pieces of evidence, such as structural information and the contents of the advertiser's page, can impact on the relevance of the ads selected to be displayed. This Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval; I.5.3 [Pattern Recognition]: Applications--Text processing General Terms Algorithms, Experimentation Keywords Web Advertising, Genetic Programming Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00. 549 work, however, did not answer important questions such as how to combine the available pieces of evidence or how much importance should be given to each evidence. This led us to a question: how can we design a ranking strategy for displaying ads according to their relevance by effectively leveraging all the evidence available? Further, given the negative impact of irrelevant ads on credibility and brand of publishers and advertisers, how to design functions that minimize the placement of irrelevant ads, especially when the relevant ones are not available? To give proper answers for these questions, we propose a new approach to content-targeted advertising based on Genetic Programming (GP). GP is a machine learning technique inspired by biological evolution to find solutions optimized for certain problem characteristics. Our assumption is that GP is able to learn the intrinsic characteristics of the content-targeted advertising problem and use them to provide solutions able to improve the ranking effectiveness. To validate our GP method we performed experiments using a real ad collection and web pages extracted from a Brazilian newspaper. The results obtained indicate that GP is able to learn ranking functions that are very effective in placing ads in web pages. In particular, our best function provided a gain over state-of-the-art strategies of approximately 61.7% in average precision. Further, GP was able to learn functions that successfully avoid the placement of irrelevant ads by calculating thresholds based on the page where the ads should be placed. This paper is organized as follows. In Section 2, we provide background information on content-targeted advertising and GP. In Section 3, we describe how we modeled the content-targeted advertising problem using GP. In Section 4, we describe our experiments and report our results. In Section 5, we describe the related work. In Section 6, we present the conclusions. Figure 1: Example of content-based advertising in the page of a company that offers health care jobs. The content of the page is ab out the usage of an identification technology called RFID. On the right side, we can see ads picked for this page by Go ogle's content-targeted advertising system. 2. BACKGROUND In this section we present background information on content-target advertising and review the main concepts in Genetic Programming. 2.1 The Content-Targeted Advertising Problem Content-targeted advertising consists in showing a list of ads in a web page, referred to as the triggering page. The ads are expected to be relevant to the users and suitable and profitable to the publishers and advertisers. Therefore, factors that contribute to the order in which the ads are displayed in the lists are: (i) the relatedness and adequacy of ads to the content of the page and (ii) the amount the advertiser is willing to pay for clicks in their ads. In this work we consider that an ad is composed of three structural parts: a title, a textual description and a hyperlink. In fact, these are the usual components of an ad in search advertising systems. The hyperlink points to a page, called landing page, where a transaction can be started. In this page, the user can also find more information related to the ad or to the company, its products and services. Figure 1 illustrates an ad list with two ad slots on the right side of a web page. For the ad in the first ad slot, the title is "RFID Alternative", the description is "Single contact 1-Wire memory with 64-bit unique serial number", and the hyperlink points to the site "www.maxim-ic.com". Besides the visible parts, a set of keywords K = {k1 , k2 , . . . , km } is associated with each ad. The keywords comprise one or more words and are used by the advertisers to describe which topics should appear in a web page to display the ads on it. For instance, for the first ad shown in Figure 1, the ad keyword could be "RFID" or "RFID alternative". To associate a certain keyword k with one of its ads, the advertiser has to bid on k in an auction type system. The more the advertiser bids on k, greater are the chances that its ads will be shown in the ad list of pages in which topic k is present. Notice that the advertisers will only pay for their bids when the users click on their ads. Further, an advertiser can associate several ads with the same product or service. We refer to such group of ads as a campaign. Notice that only an ad per campaign should be placed in a web page in order to ensure a fair use of the page advertising space and increase the likelihood that the user will find an interesting ad. In this work we are particularly interested in the relevance aspect of the content-targeted advertising problem. Given a web collection D and a set of ads A, our task is to select ads ai A related to the contents of a Web page p D and rank them according to how relevant they are. The ad list is then built in such way that more relevant ads are placed in top positions and, as far as possible, only one ad per campaign is selected. In the following, we formally define this restriction. Let C = {C1 , C2 , ..., Cn } be a partition of A that represents the set of campaigns C1 , C2 , ..., Cn . Let r(ai , p) : A × D R be a function that indicates how relevant is the ad ai to the triggering page p. Let ij p : N × C × D R be a function that represents the relevance score of the i-th topranked ad of campaign Cj according to the function r. For instance, if as is the second top-ranked ad of campaign C5 , 25p = 0.5 indicates that r(as , p) = 0.5. We are interested in finding the function rank(ai , p) : A × D R that can be used to build rank lists that satisfy the constraint: 550 i,j,k|j =k (ij p > 0 (i+1)kp > 0 ij p > (i+1)kp ) ( 1) As previously mentioned, ad placement systems should minimize the possibility of exhibiting irrelevant ads. Misplacements are particularly common in two situations. First, in spite of the ad and the page being related to the same sub ject, their mapping is not appropriate. For example, this is the case of placing ads in pages about catastrophes or unethical and illegal advertising. Second, the triggering page is about a topic for which it is hard to find relevant ads. In order to minimize misplacements in such situations, specially the second, a good ranking function should provide reliable relevance estimations such that it would be possible to distinguish the acceptable relevance levels from the not acceptable. Notice that, in this work, we intend to learn the ranking functions rank(ai , p), through GP. These ranking functions are designed to optimize overall precision and minimize the number of misplacements. 2.2 Genetic Programming Genetic programming (GP) [19] is a set of artificial intelligence search algorithms that follows the principles of biological inheritance and evolution. GP is typically used to approximate complex, non-linear functional relationships [19]. Because of the intrinsic parallel search mechanism and powerful global exploration capability in a high-dimensional space, GP has been used to solve a wide range of hard optimization problems that oftentimes have no best known solutions. The overall GP framework for a setting comprising a training and a validation collection is described in Listing 1. Listing 1: Overall GP Framework. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Let T be a t r a i n i n g document c o l l e c t i o n ; Let V be a v a l i d a t i o n document c o l l e c t i o n ; Let Ng be the number o f g e n e r a t i o n s ; Let Nt be the number o f i n d i v i d u a l s ; S ; P I n i t i a l random population o f i n d i v i d u a l s ; For each g e n e r a t i o n g o f Ng g e n e r a t i o n s do { For each i n d i v i d u a l i P do fitness i fitness (i, T ) ; Sg Get Nt top-ranked i n d i v i d u a l s o f g e n e r a t i o n g according to t h e i r f i t n e s s ; S S Sg ; P New population c r e a t e d by applying g e n e t i c o p e r a t o r s t o i n d i v i d u a l s i n Sg ; } F ; For each i n d i v i d u a l i S do F F {i, fitness (i, V )} ; B e s t I n d i v i d u a l SelectionMethod ( F , S ) ; evolve generation by generation through genetic operations (lines 7-13). A fitness function is used to assign the fitness value for each individual (line 9). The fitness value indicates how well they perform in the training examples and it can be used as a means of selecting the best ones (line 10). To evolve the best individuals, genetic operators are applied to them with the aim of creating more diverse and better performing individuals (line 12). Examples of such operators are reproduction, mutation, and crossover. The reproduction operator is used to breed new individuals identical to their parents, the crossover operator takes two individuals (parents) to breed a new one that shares some attributes with each parent, and the mutation operator simulates the deviations that occur in the reproduction process. The last step in the GP framework presented in Listing 1 consists in determining the best individual to be applied to the test set. The natural choice is the individual with best performance in the training set. However, it might not generalize well due to overfitting1 during the learning. In order to alleviate this problem the best individuals evolved over Ng generations are applied to a second document collection, which we call a validation collection (line 15). Then it is possible to select the individual that presents good performance in both sets, the training and validation (line 17). It is likely to generalize well since it proved to be a good choice in two different document sets. Therefore, an initial strategy to select the best individual should be to get the one that presents the best average performance in the training and validation sets. However, since the average does not ensure that the selected individual has a balanced performance in the both sets, it would be interesting to consider the standard deviation to correct such a bias. More formally, we apply the following method to deter¯ mine our best individual. Let fi be the average performance of individual i in the training and validation collections, and (fi ) be the corresponding standard deviation. The best individual is given by: ¯ argmax (fi - (fi )) i (2) 3. MODELING CONTENT-TARGETED ADVERTISING WITH GP In order to apply GP to the problem of content-targeted advertising we need to define three key components of the GP framework described in Listing 1: the individuals, the genetic operators and the fitness function. 3.1 Individuals Since we are interested in finding a good ranking function to match ads and pages, as described in Section 2.1, we decided to represent our individual using a tree structure. As observed by [9], a tree based representation allows for easy parsing, implementation, and interpretation. Figure 2 illustrates an individual. As we can see in Figure 2, the non-leaf nodes in the tree structure ("*", "log", and "/") represent functions applied to the terminals in the leaf nodes. The functions addition 1 Situation in which the learner may adjust to very specific random features of the training data such that the performance on the training examples still increases while the performance on unseen data becomes worse. In GP, the solution to a problem is represented as an individual (i.e., a chromosome) in a population pool. These individuals are represented by means of complex data structures such as trees, linked lists, or stacks [20]. The length or size of these data structures is not fixed, although it may be constrained by implementation to be within a certain size limit. Initially, the population starts with individuals created randomly as we can see in Listing 1 (line 6). Then they 551 tf * log / N df tf * log (N / df) Figure 2: A sample tree representation for a function. Here we show the common TF-IDF weighting scheme. (+), multiplication (), division (/) and logarithm (log ) are used in our individual representation. They were selected because they provide meaningful operations on relations. For example, matching functions used in Information Retrieval (IR) commonly employ addition and multiplication to reinforce relations in different degrees, division to accommodate inverse relationships, and logarithm to smooth values. These functions are applied t o t erminals that are t he leaf nodes in the tree structure ("tf ", "N", and "df "), as shown in Figure 2. Since in this work we intend, through GP, to discover a single ranking function to find a set of relevant ads with regard to a Web page by combining all or several of the available evidence, the terminals to be used in our representation comprise the information related to this evidence. In other words, the terminals represent statistics about the structural parts of the ads and the information provided by the advertisers such as the keywords associated with the ads and the content of the landing page. Additionally, we use real numbers as terminals to allow fixed weighted factors. Table 1 describes all the terminals to be used. Notice in this table that P stands for different structural parts of the ads and the information provided by advertisers (keyword, title, description, and landing page), and G indicates whether the ads are grouped. For instance, the feature tf ad ,title stands for the number of times a term appears in the title of an ad whereas the feature tf camp ,title represents the number of times a term appears in the titles of all the ads of a campaign. campaign, a ranking is built according to the similarity function i (lines 3-4). The top ranked ad of each ranking is then selected till that all the campaigns have been considered (lines 6-9). These ads are sorted according to the relevance scores provided by i and inserted into the final ranked list (lines 10-11). The process is repeated until that no ads remain to be selected (line 5). By doing this, we guarantee that the j -th top-ranked ad of a campaign will always be placed into a page above the (j + 1)-th top-ranked ad of any other campaign, satisfying the campaign constraint. The fitness value corresponding to individual i is then obtained through the evaluation of the final ranked list (line 12). Note that depending on the evaluation function to be used we can propose different fitness functions. The evaluation functions and corresponding fiteness functions to be used in this work are discussed in the following paragraphs. Listing 2: Fitness function. 1 2 3 4 5 6 7 8 9 10 11 12 13 function f i t n e s s ( i n d i v i d u a l i , c o l l e c t i o n T ) Let C = {C1 , ..., Cn } be the s e t o f campaigns i n T ; For a l l campaigns Cj C do rlistj Apply i t o Cj ; While e x i s t s j such that |rlistj | > 0 do For j = 1 t o |C | do I f |rlistj | > 0 then adtop e x t r a c t top-ranked ad o f rlistj ; I n s e r t adtop i n t o rlisttemp ; Sort rlisttemp ; I n s e r t ads i n rlisttemp i n t o rlistfinal p r e s e r v i n g t h e i r order ; fvalue Evaluate rlist f inal ; r e t u r n fvalue ; 3.2 Genetic Operators The genetic operators used in our model are those commonly used in GP, that is, mutation, crossover, and reproduction. Notice that, given the representation of our individuals by means of trees, the crossover operator consists in taking two trees and exchanging randomly selected subnodes of these trees forming two new children. Accordingly, the mutation operator was implemented in such a way that a randomly selected subtree is replaced by a new subtree also created randomly. A good ranked list should maximize the placement of relevant ads near to top positions since these are the positions more likely to be clicked by the users [11]. Thus, the evaluation function should take into consideration the number of relevant ads and the order in which they appear, that is, it should be a combination of precision and recall [2], two well-known retrieval measures in IR. An example of such evaluation function is given by: ik =1 r (ai ) × i j =1 pavg @k = r(aj ) ( 3) i 3.3 Fitness Function We now define the fitness function that is the ob jective function GP aims to optimize. The algorithm described in Listing 2 details our fitness evaluation function. We start by noticing that the ranked lists produced by our random individuals do not satisfy the campaign constraint given by Eq. 1. Thus, in order that function f itness (which corresponds to function rank in Section 2.1) can satisfy that constraint, we apply the individual i (which corresponds to function r in Section 2.1) to each campaign in collection T according to a round robin strategy, as follows. For each 1 where = k is a normalizing constant used to ensure that pavg @k fits between 0 and 1, k is the number of ads to be displayed in a page, ai is the i-th top ranked ad, and r(d) {0, 1} is the relevance score assigned to an ad, being 1 if the document is relevant and 0 otherwise. The relevance information is obtained from users. This metric is based on the non-interpolated average precision (PAVG), a measure commonly used in TREC evaluations [15]. The difference between metrics PAVG and pavg @k is the value of the constant , which in PAVG is given by the inverse of the total of relevant documents in 1 the collection. By using = k , we ensure that a ranking function that places relevant ads in all the top positions will receive the maximum pavg @k value equal to 1. In this way, we are able to correctly evaluate functions that suggest a number of ads less than the total of ad slots available. In this work, we refer to the fitness function that uses pavg @k to evaluate its individuals as fpavg @k . 552 Features used tf G,P tf max G,P tf av gG,P tf max colG,P leng thG,P nG,P df ad ,P df max ad ,P df camp ,P df max camp ,P Nad Ncamp N S tatistical meaning Number of times the term appeared in the part P of the ad grouped by G. Maximum tf in the part P of the ad grouped by G. Average tf in the part P of the ad grouped by G. Maximum tf G,P in the entire collection. Number of terms in the part P of the ad grouped by G. Number of distinct terms in the part P of the ad grouped by G. Number of ads in the collection the term appeared in the part P . Maximum df ad ,P . Number of campaigns in the collection the term appeared in the part P . Maximum df camp ,P . Number of ads in the collection. Number of campaigns in the collection. Real constant randomly generated by GP. Table 1: Terminals used in the GP framework for content-targeted advertising Other goal that we want to accomplish with our fitness functions is to reward ranking functions that minimize the placement of irrelevant ads. As mentioned before, these ads should be avoided since they contribute to a negative perception by the users on the credibility and brand of publishers and advertisers. A possible solution to this problem is to consider the ranking values provided by the GP individuals as estimations of how relevant the ads are to the triggering page. By doing so, we can set threshold values to distinguish acceptable relevance levels from non-acceptable ones. Thus, our problem is now finding a matching function that provides reliable estimations in a spectrum in which a threshold value can be set to separate relevant ads from non relevant ones. Our assumption is that GP is able to find such functions. Thus, given a certain threshold level t, we modify our evaluation function such that it rewards individuals that tends to place relevant ads above t and nonrelevant ads below t. Accordingly, it punishes individuals that tends to place irrelevant ads above t and relevant ads below t. Our second evaluation metric is given by: pavg @kt = 1 + k1 rat + k2 nbt pavg @k, 1 + k3 rbt + k4 nat (4) value as vmin . Thus, we have different thresholds for different pages. We refer to the fitness function that uses pavg @kt to evaluate its individuals and calculate thresholds for each page as flocal . A possible disadvantage of flocal is that it tends to suggests, at least, one ad per page. In the second strategy we use the maximum value given to an individual as vmax and the minimum value as vmin . In this case we have only one threshold value for a function. In this work we refer to the fitness function that uses pavg @kt to evaluate its individuals and calculate thresholds for each individual as fglobal . Contrary to flocal , fglobal is more likely to suggest no ads to a certain page. 4. EXPERIMENTS In this section we describe the experiments and present the results. 4.1 Sampling and Data Sets To evaluate our ad placement framework, we used a test collection built from a set of 100 pages extracted from a Brazilian newspaper. These are our triggering pages. They were crawled in such a way that only the contents of their articles were preserved. As we have no preference for particular topics, these pages cover sub jects as diverse as culture, local news, international news, economy, sports, politics, agriculture, cars, children, real estate, computers and internet, TV, travels, and economy. To obtain a set of relevant ads for our test collections, we adopted the same pooling method used to evaluate the TREC Web-based collection [16]. In other words, for each of our 100 triggering pages, we selected the top three ranked ads provided by each of the ten ad placement methods proposed in [28]. These ads were obtained from a real case ad collection composed of 93,972 ads grouped in 2,029 campaigns provided by 1,744 advertisers. With these ads, advertisers associated a total of 68,238 keywords2 . In this collection, only one keyword is associated with each ad. This makes campaigns very important since they are used by the advertisers to associate several keywords with a product or service. As a result of the pooling method, a total of 1,860 distinct ads were selected. They were then inserted into pools corresponding to each triggering pages. Each pool contained an average of 15.81 ads. All the ads were submitted to a manual evaluation by a group of 15 sub jects. Each 2 Data in the portuguese language provided by an on-line ad company that operates in Brazil. where k1 , k3 , k2 , and k4 are the weights associated with the number of relevant ads above (rat ) and below (rbt ) the threshold and non relevant ads below (nbt ) and above (nat ) the threshold, respectively. Notice that in our experiments we give more weight to nat since we want specially to avoid the placement of irrelevant ads in the top positions. In particular, we use k1 = k3 = k2 = 1 and k4 = 2. An important remaining issue is how to define the threshold value t. In this work we define t = vmin +kt (vmax -vmin ), where vmin and vmax are the minimum and maximum values given by the ranking function. The constant kt is the relative position in the spectrum the GP individual should consider a point of low confidence. In our experiments we use kt = 0.3. In other words, our new fitness functions will reward ranking functions in which the minimum score assigned to a relevant ad corresponds to 30% of (vmax - vmin ). Notice that, in fact, it is not possible to know the values of vmin and vmax because we deal with randomly generated functions. As a consequence we define these limits by inspecting the rank values provided by our random individuals. In this study we adopt two different strategies to estimate the limit values. In the first, we use the maximum value given to a certain page as vmax and the minimum 553 subject was asked to evaluate the ads selected to each page according to its relevance to the pages. The average number of relevant ads per page pool was 5.15. Since our experiment can be qualified as a supervised learning task, we follow the three data-sets design [7, 24]. In other words, the 2,337 evaluated pairs of ads and documents resulting of the pooling process were used to built training, test, and validation sets. For this, we randomly split the data into three parts. We used 50 pages (and its corresponding ads) for training, 30 pages for validation, and 20 pages for test. As previously mentioned, the introduction of the validation dataset is to help alleviate the problem of overfitting of GP on the training data and select the best generalizable individual. All the results reported in this work are based on the test data set. Metho d s AAK H GP1 Hits/Suggestions #1 9/20 14/20 #2 5/20 11/20 #3 9/20 7/20 Total 23/60 32/60 pavg @3 Score 0.314 0.508 Gain ­ +61.7% Table 2: Performance comparison b etween the b est individual evolved from the optimization of fpavg @k (GP1) and baseline metho d (AAK H). Columns lab elled #1, #2, and #3 indicate the total of hits and suggestions for the first, second, and third ad slots, resp ectively. 4.4.1 Experiments with exactly three ads per page 4.2 Setup We learned on the training sample using different parameters. We noticed that a small population size and different rates for the genetic operations produce better results. The size of the populations used in our experiment was fixed at 750 individuals. The maximum depth of the tree used to represent an individual was set as 17. In all experiments related here, the populations were created using four different random seeds and were allowed to evolve for 30 generations. This number was determined empirically. The random seeds used were 245, 37383, 322443, and 6758. As in [19], we used crossover, mutation, and reproduction rates of 85%, 10%, and 5%, respectively. We tested our GP framework using the three fitness functions described in Section 3.3. Experiments for each function were run four times using the different random seeds. The best result among the four runs is reported and used for comparison. As we can see in Table 2, our best GP individual (GP1), reached a performance of 50.8% in pavg @3. This corresponds to a gain of 61.7% when compared with our baseline. An interesting characteristic of GP1 is its successful performance in the first ad slot which is the one more likely to be clicked by the users [11]. Figure 3 displays the evolution along 30 generations of the population from which GP1 was selected. For each generation we can see the ten best individuals sorted according to the performance of their fitness function (fpavg @k ). The figure shows a remarkable difference in the performance of the individuals when we compare training and test sets. This is due to overfitting. The individuals applied to the training set tend to learn very specific characteristics not found in the test set. As a consequence, the best individuals of the training set are not so good in the test set. However, by selecting the best individual using Eq. 2, we were able to get a ranking function that generalizes well. 0.65 0.6 0.55 Average Precision 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0 30 60 90 120 150 180 210 240 270 300 Individuals sorted by generation Training Test Validation Baseline 4.3 Evaluation and Baseline We present the results of our experiments considering that a triggering page offers three ad slots. We report figures using pavg @3 (Eq. 3, with k = 3), for the case in which the methods assign exactly three ads per page. For the cases in which they are allowed to assign less than three ads, we use pavg @k (Eq. 3) and pavg @kt (Eq. 4). In all the cases, as in [28], we also report the number of hits and ads suggested per ad slot. We call hit the placement of a relevant ad. We compare the results of our GP ranking functions with those obtained by the AAK H method described in [28]. This method consists in using a cosine similarity function to match the t riggering page to the ad. Besides i ts title and description, the content of the ad, as used by AAK H, includes the content of the keyword and the landing page associated with it. Further, this method requires that all the terms in the ad keyword be present in the triggering page to the ad to be considered a good matching. Amongst the methods presented in [28], which take into account only the ad title, description, keywords, and landing page, AAK H is the best. Given these pieces of evidence, note that, as far as we know, this is the best method found in the literature. This makes AAK H an ideal baseline since our GP individuals make use of the same body of evidence. Figure 3: Evolution Pro cess for 300 individuals in 30 generations. Notice that each ten individuals corresp ond to one generation. 4.4.2 Experiments with possibly less than three ads per page 4.4 Results In this section we present the results of experiments with exactly three ads per page and with possibly less than three ads per page. Table 3 compares the performance of the best individuals obtained by GP that have evolved to avoid placing irrelevant ads according to threshold values. In this table, GP2 is the individual evolved from the optimization of flocal . The line corresponding to this individual shows performance figures for the case in which the threshold value is not taken into 554 consideration. That is, all the top ads selected by GP2 are evaluated independently of their ranking scores. The line started with GP2+thr corresponds to the same individual for the opposite case, that is, the threshold value is taken into consideration. Similarly, GP3 evolved from the optimization of fglobal and its corresponding performance figures are shown for the cases where the threshold was used (GP3+thr) and was not used (GP3). Notice in Table 3 that GP2 and GP3 present better performance than the baseline with gains of 37.2% and 9.6%, respectively, for the pavg @k metric. These results, however, are worse than those obtained with our best individual, GP1. This is due in part to the fact that more precise individuals tend to misplace ads less frequently and, consequently, they have less opportunities to be rewarded by correctly placing irrelevant ads below a certain threshold. When we analyze the performance of the individuals after applying the thresholds, we notice an improvement for GP2+thr and no difference for GP3+thr. For instance, method GP2+thr was able to avoid placing twelve irrelevant ads in the third slot with the loss of only five ads. When considering the metric pavg @kt , the gain of GP2+thr over GP2 was approximately of 16%. This allows us to conclude that GP was able to learn functions that avoid the placement of irrelevant ads and present good overall performance for the case in which different thresholds are obtained for each page. Conversely, for the case in which a unique global threshold has to be used, GP was not able to learn good ranking functions. 17, 22], document clustering and classification [14, 31], and document ranking [13, 27]. From these, many works [6­8, 10] have applied GP to discover ranking functions. For example, success has b een rep orted in applying GP to find ranking functions optimized to specific queries in the information routing task [7]. Similarly, GP has also been successfully used in the ad-hoc retrieval task [10]. In fact, this work is inspired on this prior research in ranking function discovery. But it differs significantly in several important aspects. Since we intend to find ranking functions for contenttargeted advertising, we deal with specific characteristics of this problem not found in classical IR tasks previously studied. For instance, content-targeted advertising presents different kinds of evidence, the possibility of taking advantage of campaign clustering statistics, and specific ranking related issues such as campaign placement restrictions and impact of irrelevant ads. 6. CONCLUSIONS In this paper we proposed and tested a new framework for associating ads with web pages based on GP. In particular, given the importance of relevance for content-target advertising systems, our GP method aimed to learn functions able to select the more relevant ads given the available evidence. By using a real ad collection and web pages from a Brazilian newspaper, we obtained a gain over our baseline method of 61.7%. Further, by evolving individuals to provide good ranking estimations, GP was able to discover ranking functions that are very effective in placing ads in web pages while avoiding the irrelevant ones. In the future we intend to provide more extensive and comprehensive analysis of our models and expand them in order to contemplate additional evidence and consider other important aspects of the content-targeted advertising problem. Regarding model analysis, we intend to study how different threshold tuning strategies impact on the learning and effectiveness of our GP framework. We also plan to perform more extensive comparison of our method with other machine learning techniques, such as the SVM-based approach [8]. Future plans also include a detailed study of why GP ranking functions outperform other techniques in this task. Regarding new models, we intend to evolve functions that take into consideration the category information associated with ads and pages. More important, we plan to expand our models to yield ranking functions that combine the relevance and monetary aspects of the problem by considering the amount the advertiser is willing to pay for the placement of their ads. 5. RELATED WORK The success of search advertising has motivated research in many topics related to targeted advertising. Examples of these studies include the comparison of ranking strategies [11], the characterization of fake traffic in order to detect frauds [5], the proposal of tools for keyword suggestion [3], and the design and implementation of a large-scale targeted advertising system [1]. In particular, the relevance aspect of the ranking strategies has attracted attention. This is not surprising since many works in advertising research have emphasized the importance of relevant associations for consumers [26] and how irrelevant ads can turn off users and relevant ads are more likely to be clicked on [11]. As a result, some works have tried to determine how to take advantage of the available evidence to improve the relevance of the selected ads. For instance, studies on keyword matching showed that the nature and size of the keywords have impact on the likelihood of an ad to be clicked [25]. Relevance is also the focus of the authors in [28] which proposed several strategies for ranking ads in content-targeted advertising. These strategies took into consideration the contents of structural parts of the ad and additional information obtained from web pages other than the triggering page. Examples of these pages are the landing pages or web pages obtained by means of a probabilistic model. They showed that considering the contents of the ad structural parts and external pages can improve the relevance of the selected ads. In contrast to that work, we propose to learn the best ranking strategies in order to effectively leverage all the evidence available while minimizing the placement of irrelevant ads. For this, we use GP. GP has been applied to several IR topics in recent years, such as query induction, representation, and optimization [4, 7. ACKNOWLEDGMENTS This work was supported in part by the GERINDO Pro ject grant MCT/CNPq/CT-INFO 552.087/02-5, by the CNPq grant 30.5237/02-0 (Nivio Ziviani), by the 5S-QV pro ject grant MCT/CNPq/CT-INFO 551013/2005-2, by the CNPq grant 300188/95-1 (Berthier Ribeiro-Neto), by grant 10023UFMG/RTR/PRPQ/RECEM-DOUTORES/04 (sub-grant 32-PROGRAMACAO GENETICA). Marco Cristo is supported by Fucapi, Manaus, AM, Brazil. 555 Methods AAK H GP2 GP2+thr GP3 GP3+thr Hits/Suggestions #1 9/20 10/20 10/20 10/20 10/20 #2 5/20 11/20 10/18 9/20 9/20 #3 9/20 8/20 3/ 3 5/20 5/19 Total 23/60 29/60 23/41 24/60 24/59 pavg @k Score 0.31 0.43 0.49 0.34 0.34 Gain(%) ­ +38.7 +58.1 +9.6 +9.6 pavg @kt Score ­ 1.12 1.30 0.59 0.59 Gain(%) ­ ­ +16.1 ­ 0.0 Table 3: Performance of the b est individuals evolved from the optimization of flocal (GP2) and fglobal (GP3). Columns lab elled #1, #2, and #3 indicate total of hits and suggested ads for the first, second, and third ad slots, resp ectively. Note that the values in gain columns are relative to b oldface values in the corresp onding left columns. 8. REFERENCES [17] [1] G. Attardi, A. Esuli, and M. Simi. Best bets: thousands of queries in search of a client. In Proceedings of the 13th international WWW conference on Alternate track papers & posters, pages 422­423, New York, NY, USA, 2004. ACM Press. [2] R. Baeza-Yates and B. Rib eiro-Neto. Modern Information Retrieval. Addison-Wesley-Longman, 1st edition, 1999. [3] J. J. Carrasco, D. Fain, K. Lang, and L. Zhukov. Clustering of bipartite advertiser-keyword graph. In Workshop on Clustering Large Datasets, 3th IEEE International Conference on Data Mining, Melb ourne, Florida, USA, Novemb er 2003. IEEE Computer So ciety Press. Available at http://research.yahoo.com/publications.xml. [4] O. Cordon, F. Moya, and C. Zarco. A new evolutionary algorithm combining simulated annealing and genetic programming for relevance feedback in fuzzy information retrieval systems. Soft Computing - A Fusion of Foundations, Methodologies and Applications, 6(5):308­319, Aug. 2002. [5] E. Eneva. Detecting invalid clicks in online paid search listings: a problem description for the use of unlab eled data. In T. Fawcett and N. Mishra, editors, Workshop on the Continuum from Labeled to Unlabeled Data, 20th International Conference on Machine Learning, Washington DC, USA, August 2003. AAAI Press. [6] W. Fan, E. A. Fox, P. Pathak, and H. Wu. The effects of fitness functions on genetic programming-based ranking discovery for web search. JASIST, 55(7):628­636, 2004. [7] W. Fan, M. D. Gordon, and P. Pathak. Discovery of context-sp ecific ranking functions for effective information retrieval using genetic programming. TKDE-04, 16(4):523­527, 2004. [8] W. Fan, M. D. Gordon, and P. Pathak. A generic ranking function discovery framework by genetic programming for information retrieval. IPM-04, 40(4):587­602, 2004. [9] W. Fan, M. D. Gordon, and P. Pathak. Genetic programming-based discovery of ranking functions for effective web search. Journal of Management Information Systems, 21(4):37­56, Spring 2005. [10] W. Fan, M. D. Gordon, P. Pathak, W. Xi, and E. A. Fox. Ranking function optimization for effective web search by genetic programming: An empirical study. In Proc. of HICSS-04, pages 105­112, Hawaii, 2004. [11] J. Feng, H. Bhargava, and D. Penno ck. Implementing paid placement in Web search engines: computational evaluation of alternative mechanisms. INFORMS Journal on Computing, 2006. To b e published. [12] D. Gleich and L. Zhukov. SVD based term suggestion and ranking system. In Proceedings of the 4th IEEE International Conference on Data Mining, pages 391­394, Brighton, UK, Novemb er 2004. IEEE Computer So ciety. [13] M. Gordon. Probabilistic and genetic algorithms for do cument retrieval. Communications of the ACM, 31(10):1208­1218, 1988. [14] M. D. Gordon. User-based do cument clustering by redescribing sub ject descriptions with a genetic algorithm. JASIS, 42(5):311­322, 1991. [15] D. K. Harman. Overview of the fourth text retrieval conference TREC-4. In D. K. Harman, editor, Proceedings of the Fourth Text REtrieval Conference (TREC-4), pages 1­24, Gaithersburg, Maryland, USA, November 1996. NIST Sp ecial Publication 500-236. [16] D. Hawking, N. Craswell, and P. B. Thistlewaite. Overview of TREC-7 very large collection track. In The Seventh Text [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] REtrieval Conference (TREC-7), pages 91­104, Gaithersburg, Maryland, USA, November 1998. J.-T. Horng and C.-C. Yeh. Applying genetic algorithms to query optimization in do cument retrieval. Inf. Process. Manage., 36(5):737­759, 2000. IAB and PricewaterhouseCo opers. IAB internet advertising revenue rep ort, April 2005. Available at http://www.iab.net/2004adrevenues. J. R. Koza. Genetic programming: On the programming of computers by natural selection. MIT Press, Cambridge, 1992. W. B. Langdon. Data Structures and Genetic Programming: Genetic Programming + Data Structures = Automatic Programming! Kluwer, Boston, 1998. K. Lee. The SEM content conundrum. ClickZ Exp erts, July 2003. Available at http: //www.clickz.com/experts/search/strat/article.php/2233821 . C. Lop ez-Pujalte, V. P. Guerrero-Bote, and F. de Moya-Anegon. Order-based fitness functions for genetic algorithms applied to relevance feedback. J. Am. Soc. Inf. Sci. Technol., 54(2):152­160, 2003. K. Maddox. Forrester rep orts advertising shift to online, May 2005. Available at http://www.btobonline.com/article.cms?articleId=24191. T. M. Mitchell. Machine learning. McGraw Hill, New York, US, 1996. OneUpWeb. How keyword length affects conversion rates, January 2005. Available at http://www.oneupweb.com/landing/keywordstudy_landing.htm. J. Parsons, K. Gallagher, and K. D. Foster. Messages in the medium: An experimental investigation of Web Advertising effectiveness and attitudes toward Web content. In J. Ralph H. Sprague, editor, Proceedings of the 33rd Hawaii International Conference on System Sciences-Volume 6, page 6050, Washington, DC, USA, 2000. IEEE Computer So ciety. P. Pathak, M. Gordon, and W. Fan. Effective information retrieval using genetic algorithms based matching function adaptation. In Proceedings of the 33rd Hawaii International Conference on System Science, Hawaii, USA, 2000. B. Rib eiro-neto, M. Cristo, E. S. de Moura, and P. B. Golgher. Imp edance coupling in content-target advertising. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 496­500, Salvador, Bahia, Brazil, July 2005. M. Weideman. Ethical issues on content distribution to digital consumers via paid placement as opp osed to website visibility in search engine results. In The 17th ETHICOMP, pages 904­915. Troubador Publishing Ltd, April 2004. M. Weideman and T. Haig-Smith. An investigation into search engines as a form of targeted advert delivery. In Proceedings of the 2002 annual research conference of the South African institute of computer scientists and information technologists on Enablement through technology, pages 258­258. South African Institute for Computer Scientists and Information Technologists, 2002. B. Zhang, Y. Chen, W. Fan, E. A. Fox, M. Gonalves, M. Cristo, and P. Calado. Intelligent gp fusion from multiple sources for text classification. In CIKM '05: Proceedings of the 14th ACM international conference on Information and know ledge management, pages 477­484, New York, NY, USA, 2005. ACM Press. 556