SIGIR 2007 Proceedings Session 16: Learning to Rank II A Combined Component Approach for Finding Collection-Adapted Ranking Functions based on Genetic Programming Humber to Mossri de Almeida1 Marco Cristo2 1 Marcos Andre Goncalves1 ´ ¸ Pavel Calado3 ´ IST/INESC-ID Lisboa, Por tugal pavel.calado@tagus.ist.utl.pt 3 Federal Univ. of Minas Gerais Dept. of Computer Science Belo Horizonte, Brazil 2 FUCAPI - Analysis, Research and Tech. Innovation Center Manaus, Brazil marco.cristo@fucapi.br {hmossri, mgoncalv}@dcc.ufmg.br ABSTRACT In this pap er, we prop ose a new method to discover collectionadapted ranking functions based on Genetic Programming (GP). Our Combined Component Approach (CCA) is based on the combination of several term-weighting comp onents (i.e., term frequency, collection frequency, normalization) extracted from well-known ranking functions. In contrast to related work, the GP terminals in our CCA are not based on simple statistical information of a document collection, but on meaningful, effective, and proven comp onents. Exp erimental results show that our approach was able to outp erform standard TF-IDF, BM25 and another GP-based approach in two different collections. CCA obtained improvements in mean average precision up to 40.87% for the TREC-8 collection, and 24.85% for the WBR99 collection (a large Brazilian Web collection), over the baseline functions. The CCA evolution process also was able to reduce the overtraining, commonly found in machine learning methods, esp ecially genetic programming, and to converge faster than the other GP-based approach used for comparison. Categories and Sub ject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval; I.2.6 [Artificial Intelligence]: Learning General Terms: Algorithms, Measurement, Exp erimentation. Keywords: Information Retrieval, Ranking Functions, Term-weighting, Genetic Programming, Machine Learning. 1. INTRODUCTION The growth in volume of the Web and other textual rep ositories, such as digital libraries, throughout the last decade, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. Copyright 2007 ACM 978-1-59593-597-7/07/0007 ...$5.00. has made the information retrieval task difficult, costly, and in many cases, very complex for the end user. In this context, search engines b ecame valuable tools to help users find content relevant to their information needs. Naturally, research on information retrieval models that can effectively rank search results according to document relevance has b ecome a fundamental sub ject. Information Retrieval models have come a long way. Although the most p opular is still undoubtedly the vector space model prop osed by Salton [19], many new or complementary alternatives have b een prop osed, such as the Probabilistic Model [16]. From all these models, document ranking formulas can b e derived for document searching. Thus, many alternatives exist on how to comp ose a ranking function. Most of them have a common characteristic: they attempt to b e very general in nature, i.e., they were designed to b e applied in any typ e of collection. The work of Zob el and Moffat [26], for example, presented more than one million p ossibilities to compute a similarity function. However, after all the exp eriments, they concluded that no weighting scheme is consistently good in all collections. That is, a ranking function can have success in one domain but fail in another. Further, they comment that it would b e prohibitive to discover the b est weighting scheme simply by an exhaustive exploration of the similarity space. In this work, we discover sp ecialized ranking strategies for sp ecific collections. Our method is able to consider the imp ortant and unique characteristics of each collection so that the discovered function is more effective than any general solution. To accomplish this, we use Genetic Programming (GP), a machine learning technique inspired by Darwinian evolutionary processes, to discover sp ecific ranking functions for each document collection. GP has b een successful in many IR problems [7­10, 13, 22]. GP was chosen due to its ability to find any arbitrary function, even when dealing with very large search spaces. However, differently from other GP-based approaches, which use only basic statistical information from terms and documents, our strategy uses rich, meaningful, and proven effective comp onents present in well-known ranking formulas, such as Okapi BM25 [18] and Pivoted TF-IDF [21]. Our assumption is that by providing these human-discovered formula comp onents as building blocks, the GP process can take advantage of all the human 399 SIGIR 2007 Proceedings Session 16: Learning to Rank II knowledge that has b een applied to produce them. As a consequence, it will b e able to b etter explore the search space. To validate our GP approach we p erformed exp eriments with the TREC-8 and WBR99 collections. Results indicate that the use of meaningful comp onents in a GP-based framework leads to effective ranking functions that significantly outp erform the baselines (standard TF-IDF, BM25 and another GP-based approach [9]). Our Combined Comp onent Approach (CCA) ranking functions also converged to good results faster than the GP approach used as baseline, and the overtraining also was reduced. This pap er is organized as follows. In Section 2, we provide background information on term-weighting comp onents and genetic programming. In Section 3, we present our Combined Comp onent Approach for similarity calculation. The collections used and exp erimental results are detailed in Section 4. In Section 5, we describ e related work. Finally, Section 6 concludes the pap er and gives suggestions for future work. calculating the similarity b etween documents and queries, demonstrating that the space of p ossibilities for customizing and refining ranking functions is extremely large. This has stimulated the application of effective search space exploration techniques, such as GP [11], for discovering collectionadapted similarity functions. 2.2 Genetic Programming Genetic Programming (GP), an inductive learning technique introduced by Koza in [11] as an extension to Genetic Algorithms (GA), is a problem-solving system inspired by the idea of Natural Selection. The search space of a problem, i.e., the space of all p ossible solutions to the problem, is investigated using a set of optimization techniques that imitate the theory of evolution, combining natural selection and genetic op erations to provide a way to search for the fittest solution. The evolution process starts with an initial p opulation comp osed by a set of individuals. Generally, the initial p opulation is generated randomly. Each individual denotes a solution to the examined problem and is represented by a tree. To each individual is associated a fitness value. This value is determined by an evaluation function, also known as fitness function. The fitness value indicates goodness of an individual and it is used to eliminate from the p opulations all "unfit" individuals, selecting only those that are closest to the desired goal. The individuals will evolve generation by generation through genetic op erations such as reproduction, crossover, and mutation. The reproduction op erator simply breeds a new individual. The mutation op erator simulates the deviations that take place in the reproduction process. Finally, the crossover op erator generates new individuals by the comp osite of some characteristics present in two other individuals (the parents ). Thus, for each generation, after the genetic op erations are applied, a new p opulation replaces the current one. The fitness value is measured for each new individual, and the process is rep eated over many generations until the termination criterion has b een satisfied. This criterion can b e a preestablished maximum numb er of generations or some additional problem-sp ecific success predicate to b e reached (e.g., an intended value of fitness for a sp ecific individual). 2. BACKGROUND In this section we present the term-weighting comp onents used in our approach and a brief review of some concepts of Genetic Programming. 2.1 Term-Weighting Components In [20], Salton and Buckley present a sp ecification for the main function of a term-weighting system. Generally, a typical term-weighting formula is defined as b eing comp osed of two comp onent triples: tf cq , cf cq , ncq , which represents the weight of a term in a user query q , and tf cd , cf cd , ncd , which represents the weight of a term in a document d. The term frequency comp onent (tfc ) represents how many times a term occurs in a document or query. The collection frequency comp onent (cfc ) considers the numb er of documents in which a term app ears. Low frequencies indicate that a term is unusual and thus more imp ortant to distinguish documents. Finally, the normalization comp onent (nc ) tries to comp ensate for the differences existing among the document lengths. Typical term-weighting formulas combine these three comp onents. For instance, as in [20], we can define wtd =tf cd × cf cd × ncd , and wtq =tf cq × cf cq × ncq , where wtd is the weight of term t in document d and wtq is the weight of term t in query q . A common definition for some of these comp onents is tf cd as the raw term frequency of term t in document d, cf cd as the inverse document frequency (idf ) of term t (usually defined as cf cd = log (N/nt ), where N is the total numb er of documents and nt is the numb er of documents where term t occurs), and ncd as the inverse of the size of document d. This is usually called a TF-IDF (term frequency­inverse document frequency) weighting scheme. We can express a ranking function based on such a termweighting system as follows: t sim(q , d) = wtd × wtq (1) q 3. COMBINED COMPONENT APPROACH Our Combined Component Approach (CCA) is a GP-based approach for discovering good ranking formulas. Our goal is to discover new ranking functions more adapted to the sp ecificities of a particular collection. As mentioned b efore, differently from other GP approaches, our idea consists in examining imp ortant information retrieval ranking formulas from systems such as [1, 4, 18] and extracting from their ranking schemes comp onents such as those describ ed in Section 2.1. These comp onents can b e entire formulas or some parts of a ranking formula. Once they are identified, we can use them as building blocks of our GP approach and combine them to generate new ranking functions. where sim(q , d) is the similarity measure b etween a query q and a document d. Ten years after Salton and Buckley's prop osal, the work of Zob el and Moffat [26] explored this taxonomy further by adding eight different typ es of weighting functions. Their approach leads to more than 1,500,000 combinations for 3.1 CCA Modeling for Ranking Function Discovery using Genetic Programming As in Fan et al. [7], we use a tree data structure to represent a term weighting formula. This tree-based representation allows for easy parsing, implementation and interpretation. Figure 1 illustrates an example individual representing 400 SIGIR 2007 Proceedings Session 16: Learning to Rank II a TF-IDF formula. The leaf nodes in such trees are called terminals, and represent the basic information units that will b e comp osed to create the final formula. Terminals are combined through functions which are represented in the internal nodes. In previous works [7­10, 22], terminals always reflect basic statistics directly derived from the collection, such as term frequency or document size. Our work differs from this approach, as explained b elow. Figure 2 displays another individual representing the TFIDF weighting scheme. In this case, idf information is itself a terminal and not the result of a combination of terminals. In previous approaches, this information is a subtree that must b e explicitly discovered by the GP evolutionary process. Thus, our CCA approach takes advantage of using information previously known to b e effective in other ranking formulas, such as idf , pivoted normalization and others, allowing for a more effectively oriented exploration of the search space. Listing 1: GP Framework used by CCA 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Let T be a t r a i n i n g s e t o f q u e r i e s ; Let V be a v a l i d a t i o n s e t o f q u e r i e s ; Let Ng be the number o f g e n e r a t i o n s ; Let Nb be the number o f bes t i n d i v i d u a l s ; P I n i t i a l random population o f i n d i v i d u a l s ; Bt ; For each gener ati on g o f Ng g e n e r a t i o n s do { Ft ; For each i n d i v i d u a l i P do Ft Ft {g , i, fitness (i, T )} ; Bt Bt getBestIndividuals (Nb , Ft ) ; P applyGeneticOperations (P , Ft , Bt , g ) ; } Bv ; For each i n d i v i d u a l i Bt do Bv Bv {i, fitness (i, V )} ; Bes tIndi vi dual applySelectionMethod ( Bt , Bv ) ; function is used but on the validation set of queries and documents. Only the individuals that p erform the b est in this phase are selected as the final solutions. This process is describ ed in Listing 1. Terminals and Functions Figure 1: A sample tree for a TF-IDF individual based on statistical information An individual is represented by terminals and functions, organized in a tree structure, as shown in Figure 2. Terminals contain information extracted from the main ranking formulas published on IR literature. Table 1 describ es all the terminals to b e used. Besides these, we also use constant values in the range [0..100]. As functions, we use addition (+), multiplication (), division (/) and logarithm (log). Genetic Operators Based on Koza [11], we use the genetic op erators of reproduction, crossover and mutation as detailed in Section 2.2. Figure 2: A sample tree for a TF-IDF individual based on our Combined Component Approach Fitness Function Since our individuals represent term-weighting schemes to b e used in a document ranking function, our fitness function must measure the quality of the ranking generated by a given individual. We exp erimented with two different functions: (1) non-interp olated average precision over all relevant documents (PAVG) as describ ed in [5, 24], and (2) function FFP4 as defined in [5], a utility function based on the idea that the utility of a relevant document decreases with its ranking order. GP Framework The GP framework is basically an iterative process with two phases: training and validation. For each phase, we select a set of queries and documents from the collection, which we call the training set and the validation set. The framework starts with the creation of an initial random p opulation of individuals that evolves generation by generation using genetic op erations (reproduction, crossover, and mutation). The process continues until a stopping criterion is met. In the training phase, each time a new generation is created, the fitness function is applied to each new individual, to select only the fittest. Since each individual represents a weighting scheme, applying this fitness function corresp onds to ranking the set of training documents according to the set of training queries, using the individual's weighting scheme. The obtained fitness value is simply a quality assessment of the generated ranking. After the last generation is created, to avoid selecting individuals that work well in the training set but do not generalize for different queries/documents (the overfitting problem), the validation phase is applied. In this case, the fitness Selection of the Best Individual As mentioned b efore, we use a validation set to help in choosing good solutions that are not over-sp ecialized for the training queries, i.e. that are able to generalize for unseen queries. In [12], the choice of the b est individual is accomplished by considering the average p erformance of an individual in b oth the training and validation sets minus the standard deviation value. We call this method AVG . The individual with the highest value of AVG will b e selected as the b est. Despite b eing a balanced approach, our exp eriments have shown that this method may not lead to the b est p erformance in some runs. We therefore prop ose a similar method, which also considers the disp ersal b etween training and validation values, but uses the sum of these values in place of 401 SIGIR 2007 Proceedings Session 16: Learning to Rank II Table 1: Terminals used by our CCA approach in the GP framework Id t01 t02 t03 t04 t05 t06 t07 t08 t09 t10 t11 t12 Terminal tf 1 + log(tf ) 0.5 + 0.5+tf max tf 1+log(tf ) 1+log(avg tf ) (k1 +1)tf (k1 ((1-b)+b×dl/avg dl)+dl)+tf N N log log log Description Raw term frequency (number of times a term occurs in a document) [20] Natural logarithm of term frequency factor as presented in [25] to smooth the influence of term frequency Term-frequency factor normalized by maximum tf in a document, and further normalized to lie between 0.5 and 1.0 [2, 20] Term-frequency factor normalized by average tf in a document as defined in [21]. Part of SMART weighting scheme formula as defined in [4] Part of Okapi BM25 ranking formula with term frequency tfc and normalization nc components [17] I nverse document frequency (idf ) [20] A n alternative to inverse document frequency (idf ) as presented in [25] variation of the Robertson-Sparck Jones weight [16] A R d f d f N +1 -d +0.5 f 0.5 w (1) = log N -d f log d f log N +0.5 d f N -d +0.5 f d +0.5 f obertson-Sparck Jones weight [16­18] probabilistic inverse collection frequency [16, 20] A log N +1 i 21 2 wi,j Part of INQUERY belief function to calculate the belief in a term within a document [1] 2 Cosine normalization where wi,j is t01 × t07 [20, 25] 2 Cosine normalization where wi,j is t02 × t07 [20, 25] =0 t13 i 21 2 wi,j =0 t14 t15 t16 t17 t18 t19 t20 dl 1 avgt (1-slope)+slope× t 13 13 Document length (in bytes) normalization. A normalization technique used in degraded text collections (e.g. OCR-based repositories) [21] Pivoted cosine normalization with t13 terminal as defined in [21] Pivoted byte size normalization as defined in [21] Pivoted unique normalization [21], where pivot is the average of the number of unique terms in a document Normalization factor present in Okapi BM25 scheme, as defined in [17, 18] Query factor present in Okapi BM25 scheme, as defined in [17, 18] Augmented normalized query term frequency, as defined in [2, 20] 1 (1-slope)×avg dl+slope×dl 1 (1-slope)×piv ot+slope×# of unique terms k 1 dl 1 ×(1-b)+b× avgdl + tf (k3 +1)×q tf k3 + q t f 0.5 + 0.5+q tf max q tf the average. We call this method SUM . More formally, let ti b e the training p erformance of an individual i, let vi b e the validation p erformance of this individual, and let i b e the corresp onding standard deviation. The b est individual is selected by: argmax ((ti + vi ) - i ) i (2) The exp eriments describ ed in this pap er in Section 4 were p erformed with b oth AVG and SUM methods. 4. EXPERIMENTS In this section we describ e our exp eriments and present the results obtained. 4.1 Data Sets In our exp eriments, we use the TREC-8 [24] and WBR99 collections to evaluate our approach. The TREC-8 collection has roughly 528K documents, and 737K distinct terms. Our exp eriments were p erformed using 50 topics numb ered from 401 to 450. For each topic, there is a set of relevant documents that can b e used for testing a new ranking formula. The queries were automatically generated using the title, description and narrative of each topic. We divide the queries into three groups: topics 401-420 were used for the training phase, topics 421-430 were used for the validation phase, and topics 431-450 were used for the test phase. The WBR99 collection1 , which has b een previously used in works such as [15], contains a database of Web pages, a set of queries and a set of relevant documents associated with each query. The relevant documents were generated through a manual evaluation of ranked documents retrieved by a query processor based on the vector space model. The collection has almost 6M Web pages, crawled from the Brazilian web (the .br domain), and almost 2.7M distinct terms. It represents a considerably connected snapshot of the Brazilian Web community, which is probably as diverse in content and link structure as the entire Web. Thus, we b elieve it makes a realistic testb ed for our exp eriments. Queries 1-20 1 Available at http://www.linguateca.pt/Rep ositorio/WBR99/ 402 SIGIR 2007 Proceedings Session 16: Learning to Rank II were used for the training phase, queries 21-30 for validation, and queries 31-50, except query 35, for the test phase. All the results rep orted in Section 4.4 for b oth collections are based on the test queries. at 9 document cutoff values, and R-precision. We also plot all retrieval results in precision-recall curves. 4.4 Results In this section we present the results of our exp eriments. The b est results were obtained by ranking functions discovered using the SUM method to select the b est individuals, and maximum tree depths of 5 and 8 for the TREC-8 and WBR99 collections, resp ectively. We rep ort the b est individuals for TREC-8 considering their global p erformance. For WBR99, we considered the b est p erformance overall and in the top of the ranking, an imp ortant asp ect for a Web collection3 . The b est ranking functions discovered by CCA for the TREC-8 and WBR99 collections are shown in Figures 3 and Figure 4, resp ectively. (* (* (log t08 ) (+ t05 t07 )) (+ (+ (* (+ t19 t05 ) (+ t07 t06 )) (* (+ t06 t02 ) (* t16 t18 ))) (/ t07 t19 )) 4.2 GP Parameters and Setup Almost all the parameters and configurations were tuned based on Fan et al.'s work [9]. An initial p opulation of 200 individuals was created randomly using the ramped half-andhalf method. We exp erimented with PAVG and FFP4 fitness functions, as describ ed in Section 3.1, for the TREC-8 and WBR99 collections. PAVG was more effective for the TREC-8 collection, whereas FFP4 was the b est choice for WBR99. Thus, we rep ort only the results obtained with PAVG for TREC-8 and with FFP4 for WBR99. Due to the stability of the results after 30 generations, we defined this value as the termination criterion. We used crossover, reproduction and mutation rates of 90%, 5%, and 5%, resp ectively. The random seed used was 1234567890. At the end of each generation, the validation phase was run for the top 20 b est individuals discovered on the training phase of that generation. The terminals used were those defined by our CCA approach in Table 1 plus real constant numb ers generated randomly. We used the addition (+), multiplication (*), division (/) and protected logarithm2 (log) functions to combine terminals and subtrees of an individual. The maximum depth of the generated trees ranged from 3 to 12. We exp erimented with each value of maximum depth to analyze its influence on the quality of the discovered ranking functions. ) Figure 3: CCA discovered ranking function for the TREC-8 collection (+ (+ (+ (+ (* (* (* (* (+ (* (* (* ) 99.09 t11 ) (* t07 t10 ) t05 (* (+ (* t07 t10 ) (+ t08 t10 )) (* t12 t01 )))) (* t07 t10 ) t05 (* (+ (* t02 t04 ) (+ t08 t10 )) (* t12 t01 )))))) t12 t01 ) (* t07 t10 ) t05 (* (+ (/ t08 t20 ) (+ t08 t10 )) (* t12 t01 ))))) 4.3 Evaluation and Baselines We compared the retrieval results of our approach with three others: (i) Okapi BM25, as describ ed in [17] (using parameters k1 =1.2, k2 =0, k3 =1000, b=0.75), (ii) vector space model with standard TF-IDF weighting scheme, and (iii) the GP approach prop osed by Fan et al. in [9]. Due to space restrictions, we rep ort only the two b est methods p er collection -- BM25 and FAN-GP for TREC-8, and TF-IDF and FAN-GP for WBR99. TF-IDF outp erformed BM25 for the WBR99 collection, probably due to the way that the relevance information was obtained. Despite the fact that Fan et al.'s original work had not varied the tree depth, we allowed the depth to vary b etween 3 and 12, as in our approach, in order to ensure a fair comparison. The remaining GP parameters were set as defined in that work, except the mutation factor, which was set to 5%. Although Fan et al. ran the exp eriments without mutation, our exp eriments showed that this op eration leads to b etter results. We chose the b est ranking function discovered by Fan et al.'s approach (henceforth named FAN-GP) to use as a baseline. This selection of the b est ranking function was done as describ ed in Section 3.1, due to the AVG and SUM selection methods leading to ranking functions at least as good as the b est validation individuals. To evaluate the p erformance of our approach against the baselines, we used, as in [24], the (non-interp olated) average precision measure over all relevant documents, the precision 2 The protected logarithm function, used to prevent numeric overflow, returns zero if its argument is zero and otherwise returns the natural logarithm of the absolute value of its argument [11]. Figure 4: CCA discovered ranking function for the WBR99 collection Each discovered ranking function had one terminal related to term frequency comp onent for the queries -- t19 or t20 -- and at least three different terminals related to collection frequency -- t06 , t07 , t08 , t09 , t10 , or t11 . Figure 3 shows parts of the BM25 ranking function, t05 , t18 , and t19 . Figure 4 presents general TF-IDF comp onents such as t01 , t02 , t07 , and t12 . This shows that CCA was able to reuse good comp onents of the b est baseline function for each collection and to combine them with others to generate b etter ranking functions. Figure 5 displays the evolution process after 30 generations. For each generation, we evaluate the b est 20 individuals sorted according to their p erformance, calculated through the fitness function. As we can see, CCA ranking functions converge faster than FAN-GP. The figures also show that the CCA curves b ehave very similarly. Despite the fact that training, validation, and test sets present different fitness values, validation and test curves tend to follow the training b ehavior. In the FAN-GP curve, test and validation curves do not follow the training b ehavior, which suggests overfitting. The gap b etween the training curve and the others also indicates overfitting -- the greater the gap, the higher the overfitting. As we can see, the CCA curves are also closer than FAN-GP. 3 In fact, contrary to TREC-8, the b est global individual in WBR99 was not the b est on the top of the ranking. 403 SIGIR 2007 Proceedings Session 16: Learning to Rank II Figure 5: TREC-8 and WBR99 evolution processes for the best 20 individuals in 30 generations Figure 6: TREC-8 and WBR99 precision-recall curves Figure 6 shows the 11-p oint average precision figures for CCA and the baselines on TREC-8 and WBR99 collections. We notice that CCA yields b etter precision values than TFIDF, BM25 and FAN-GP throughout almost all recall levels. Table 2 presents detailed average precision figures at different ranking levels. As we can see, CCA yields b etter precision than the BM25 and FAN-GP approaches for TREC-8 and b etter than TF-IDF and FAN-GP for the WBR99 collection. For the TREC-8 collection, CCA reached an average precision of 16.40%, corresp onding to gains of 40.87% over BM25 and 14.00% over FAN-GP. For the WBR99 collection, CCA achieved an average precision of 16.68%, corresp onding to gains of 21.67% over TF-IDF and 24.85% over FAN-GP. We also observe that FAN-GP did not surpass the TF-IDF results for the WBR99 collection. All the CCA p erformance improvements related to average precision were statistically significant at p < 0.1 for TREC-8 and p < 0.05 for WBR99. Table 2 shows the confidence levels (1-p) obtained by the pair-wise t-tests. We also observe in Table 2 that, for the top 5 documents, CCA improves BM25 and FAN-GP by 18.52% on TREC-8, and improves TF-IDF and FAN-GP by 59.10% and 2.94% on WBR99. Our CCA approach also attains improvements in R-precision of more than 26% over BM25 and 15% over FAN-GP for TREC-8, and more than 8% over TF-IDF and 12% over FAN-GP for WBR99. Figure 7 presents the average precision for each query. For the TREC-8 collection, we observe that our CCA results improve the p erformance over FAN-GP in some topics where it had no gain over BM25, such as topics 438, 443, 445, 449, and 450. Figure 7 also shows that CCA's ranking functions only present worse p erformance than BM25 and FAN-GP in topics where mean average precision is already very low, such as 432, 435, 437, 440, and 448. For the WBR99 collection, CCA obtains b etter results than the baselines for the queries 31, 32, 33, 34, 44, 47, 48, 49, and 50. CCA stands next to the b est results for the queries 39, 40, and 46. Our approach does not surpass FAN-GP for the queries 37, 39, 41, 42, 43, and 45. Considering the exp eriment with the depth that produced our b est individual for each collection, more than 30 different individuals discovered by CCA surpassed the baselines for TREC-8 and WBR99, which indicates that CCA is able to find several good ranking functions. 5. RELATED WORK Different approaches to discover ranking functions based on machine learning techniques, such as genetic algorithms and genetic programming, have b een prop osed in the literature. Fan et al. [7] prop osed a new approach to automatically generate term weighting strategies for different contexts, based on genetic programming (GP). They argue that each sp ecific context demands a different term weighting strategy, that is, a ranking function should b e adapted to different document collections and users. In their works [8­10] they have demonstrated that GP has b een effective at improving the p erformance of information retrieval tasks. In [5], they also study the effect of different utility functions, i.e., functions that privilege the retrieval of relevant documents on the top of the ranking, when used as fitness functions. In all these works, terminals are based on basic statistical information of the collection, documents and queries, such as term frequencies (tf ), total numb er of documents, length in bytes of a document, numb er of unique terms in a document, etc. In [8], Fan et al. compare 404 SIGIR 2007 Proceedings Session 16: Learning to Rank II Table 2: TREC-8 and WBR99 document level average figures for CCA TREC-8 Level At 5 At 10 At 15 At 20 At 30 At 100 At 200 At 500 At 1000 do cs do cs do cs do cs do cs do cs do cs do cs do cs WBR99 CCA Baselines Gain over FAN-GP +18.52% +23.53% +9.46% +9.89% +3.91% +5.94% +2.12% +6.84% +10.99% +15.51% +14.00% 98.19% TF-IDF 23.157 26.315 30.175 32.105 25.263 13.421 9.000 3.726 1.968 20.607 13.710 FAN-GP 35.790 32.632 29.474 26.842 21.579 10.737 7.553 3.516 1.879 19.830 13.361 Precision 36.842 32.632 29.123 27.368 25.965 13.632 8.790 3.737 2.005 22.294 16.681 Baselines BM25 27.000 29.000 25.333 23.750 20.167 15.050 11.900 7.410 4.910 16.116 11.643 FAN-GP 27.000 25.500 24.667 22.750 21.333 15.150 11.775 7.160 4.505 17.703 14.388 Precision 32.000 31.500 27.000 25.000 22.167 16.050 12.025 7.650 5.000 20.449 16.402 CCA Gain over TF-IDF +59.10% +24.00% -3.49% -14.75% +2.78% +1.57% -2.34% +0.28% +1.87% +8.19% +21.67% 96.96% Gain over FAN-GP +2.94% 0.00% -1.19% +1.96% +20.33% +26.96% +16.38% +6.29% +6.73% +12.43% +24.85% 96.04% Gain over BM25 +18.52% +8.62% +6.58% +5.26% +9.92% +6.64% +1.05% +3.24% +1.83% +26.89% +40.87% 94.05% R-precision Average Confidence Level Figure 7: TREC-8 and WBR99 query-by-query comparisons FAN-GP to a learning approach based on neural networks. Their results showed that FAN-GP overcome substantially the neural network strategy. In [10], FAN-GP exploits structural information of Web documents. Their approach was compared to one based on Supp ort Vector Machines (SVM). FAN-GP manages to improve SVM for b oth ad hoc and routing tasks in retrieval. The work in [22] also presents a GP approach, based on statistical information of the collection, documents, and queries. Additionally, and unlike the works of Fan et al., this work adds the baseline functions, such as those in [18, 20, 21] as individuals in the initial p opulation. According to the author, this addition guarantees that the worst p ossible p erformance during the training phase is at least as good as the b est baseline of the functions added. The fact that Trotman has chosen to represent individuals (including the baselines) with simple statistical information means that he could not guarantee the integrity of the baseline comp onents, since these could b e very distorted or completely destroyed by the evolutionary process. In fact, the b est individuals rep orted by Trotman did not contain comp onents of the baselines. CCA instead guarantees the preservation of these good building blocks and the knowledge they carry, having (indirectly) outp erformed Trotman's approach. When applied to a sp ecific collection (FBIS), Trotman was able to achieve an improvement of 32.90% in MAP. This, however, was describ ed as an atypical result. The average gain for all tested collections in Trotman's b est run was around 8%, with negative gains for some collections. Our results surpassed BM25 by 40.87% for TREC-8, suggesting a substantial improvement over Trotman's approach. Finally, different from CCA, in which we advocate a collection-based approach, Trotman has the goal of finding ranking functions good for all colections. In [13], Oren explores several tf -idf functions generated by a GP approach. His work uses statistical information of the collection, idf , and two instances of tf as terminals. However, his improvements over a basic tf -idf strategy were not significant. In contrast to all of these aforementioned works, our approach uses parts of well-known, significant, and proven effective ranking formulas as terminals, for representing termweighting comp onents, instead of simple statistical information. As mentioned b efore, our hyp othesis is that providing richer comp onents for GP to work with will allow the discovery of b etter final ranking formulas. In fact, our approach proved to b e more stable and consistent in generating effective ranking functions than previous work, which used only basic statistical information. Other works that are also somewhat related to ours have focused on the combination of results from different ranking formulas. In [3], Bartell et al. show that retrieval effectiveness can b e improved significantly by a combination of the results of a numb er of different retrieval algorithms (or experts ), since these emphasize different asp ects of the documents to b e retrieved. Pathak et al. [14] introduces a method of utilizing Genetic Algorithms in Information Retrieval tasks. Sp ecifically, it shows how GA can b e used to adapt several matching functions that are used to match documents descriptions with query descriptions. Vogt et al. [23] prop ose a linear combination model for fusion of ranking results. This model 405 SIGIR 2007 Proceedings Session 16: Learning to Rank II combines the result lists of multiple IR systems by scoring each document using a weighted sum of the scores from each of the comp onent systems. The work in [6] also uses Genetic Algorithms to combine different exp ert matching functions. The weights associated with combinations are evolved using GA. Our approach differs from these, since we do not combine ranking formulas or result lists, but comp onents or p ortions of ranking functions from different retrieval systems to generate a completely new ranking formula. 6. CONCLUSIONS AND FUTURE WORK In this pap er we have introduced a new approach based on the combination of term weighting comp onents, extracted from well-known information retrieval ranking formulas, using genetic programming. We show that our Combined Component Approach (CCA) improves the retrieval p erformance compared to standard TF-IDF, BM25 and other GP-based approaches [9, 22] that use only basic statistical information derived from collections and documents. We used the TREC-8 and WBR99 collections to validate our approach. Using the TREC-8 collection, our exp eriments showed that CCA leads to significant improvements in retrieval effectiveness. Mean average precision was improved by 40.87% and 14.00% over BM25 and FAN-GP resp ectively. R-precision and precision at the top of the ranking improvements were also observed. For the WBR99 collection, we obtained improvements of 21.67% and 24.85% in mean average precision over TF-IDF and FAN-GP, resp ectively. Examining the CCA evolution process, we observed that CCA ranking functions converged faster than FAN-GP. Our approach also reduced overfitting. Results obtained by CCA lead us to conclude that the use of meaningful terminals instead of simple statistical information improves the quality of the process of discovering ranking functions with GP. For future work, we will investigate ranking formulas from other IR models such as the Set-based Model [15] to extract new terminals. We will utilize the structural information within documents to compare our approach to others, such as [10], for Web search. We also will investigate the influence of the mutation factor, tree depth, and p opulation length on the discovery of good ranking functions, considering the use of meaningful terminals and statistical information. Finally, but not less imp ortant, we also intend to examine closely the discovered b est ranking functions to understand b etter how they work and the reasons for their effectiveness. 7. ACKNOWLEDGMENTS Sp ecial thanks go to Prof. Edward Fox for his very helpful comments. This work is partially supp orted by pro jects GERINDO (CNPq/CT-INFO 552.087 /02-5), 5S-VQ (CNPq /CT-INFO 55.1013/2005-2), FCT pro ject ref. POSC/EIA/ 58194/2004, GRICES/CNPq bilateral coop eration pro ject "ADAPTINF", by grant 10023 - UFMG/RTR/PRPQ/RECEM-DOUTORES/04 (sub-grant 32 - PROGRAMACAO GENETICA), and by an individual grant from CNPq to Marcos Andr´ Gon¸alves. e c 8. REFERENCES [1] J. Allan, J. P. Callan, F. Feng, and D. Malin. INQUERY and TREC-8. In Proceedings of TREC-8, pages 637­644, Gaithersburg, MD, 1999. NIST Sp ecial Publication 500-246. [2] R. Baeza-Yates and B. Rib eiro-Neto. Modern Information Retrieval. Addison-Wesley-Longman, Boston, MA, 1999. [3] B. T. Bartell, G. W. Cottrell, and R. K. Belew. Automatic combination of multiple ranked retrieval systems. In Proceedings of the 17th ACM SIGIR, pages 173­181, 1994. [4] C. Buckley, A. Singhal, and M. Mitra. New retrieval approaches using smart: TREC 4. In Proceedings of TREC-4, pages 25­48, Gaithersburg, MD, 1996. NIST Sp ecial Publication 500-236. [5] W. Fan, E. A. Fox, P. Pathak, and H. Wu. The effects of fitness functions on genetic programming-based ranking discovery for web search. Journal of the American Society for Information Science and Technology, 55(7):628­636, 2004. [6] W. Fan, M. Gordon, and P. Pathak. On linear mixture of exp ert approaches to information retrieval. Decision Support Systems, 42(2):975­987, 2006. [7] W. Fan, M. D. Gordon, and P. Pathak. Personalization of search engine services for effective retrieval and knowledge management. In Proceedings of the 21st Intern. Conf. on Inf. Systems, pages 20­34, Brisbane, Australia, 2000. [8] W. Fan, M. D. Gordon, and P. Pathak. Discovery of context-sp ecific ranking functions for effective information retrieval using genetic programming. IEEE Transactions on Know ledge and Data Engineering, 16(4):523­527, 2004. [9] W. Fan, M. D. Gordon, and P. Pathak. A generic ranking function discovery framework by genetic programming for information retrieval. Information Processing and Management, 40(4):587­602, 2004. [10] W. Fan, M. D. Gordon, and P. Pathak. Genetic programming-based discovery of ranking functions for effective web search. Journal of Manag. Inf. Syst., 21(4):37­56, 2005. [11] J. R. Koza. Genetic Programming: On the programming of computers by natural selection. MIT Press, Cambridge, 1992. [12] A. Lacerda, M. Cristo, M. A. Goncalves, W. Fan, N. Ziviani, and B. Rib eiro-Neto. Learning to advertise. In Proceedings of the 29th ACM SIGIR, pages 549-556, 2006. [13] N. Oren. Reexamining tf.idf based information retrieval with genetic programming. In Proceedings of the SAICSIT 2002 Conference, pages 224­234, 2002. [14] P. Pathak, M. Gordon, and W. Fan. Effective information retrieval using genetic algorithms based matching functions adaptation. In Proceedings of the 33rd HICSS, Hawaii, 2000. [15] B. P^ssas, N. Ziviani, J. Wagner Meira, and B. Rib eiro-Neto. o Set-based vector mo del: An efficient approach for correlation-based ranking. ACM TOIS, 23(4):397­429, 2005. [16] S. E. Rob ertson and K. S. Jones. Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3):129­146, 1976. [17] S. E. Rob ertson and S. Walker. Okapi/keenb ow at TREC-8. In Proceedings of TREC-8, pages 151­162, Gaithersburg, MD, 1999. NIST Sp ecial Publication 500-246. [18] S. E. Rob ertson, S. Walker, S. Jones, M. M. Hanco ck-Beaulieu, and M. Gatford. Okapi at TREC-3. In Proceedings of TREC-3, pages 109­126, Gaithersburg, MD, 1995. NIST Sp ecial Publication 500-226. [19] G. Salton. The SMART retrieval system - Experiments in automatic document processing. Prentice Hall Inc., Upp er Saddle River, NJ, 1971. [20] G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513­523, 1988. [21] A. Singhal, C. Buckley, and M. Mitra. Pivoted do cument length normalization. In Proceedings of the 19th ACM SIGIR, pages 21­29, 1996. [22] A. Trotman. Learning to rank. Information Retrieval, 8(3):359­381, 2005. [23] C. C. Vogt and G. W. Cottrell. Fusion via a linear combination of scores. Information Retrieval, 1(3):151­173, 1999. [24] E. M. Vo orhees and D. Harman. Overview of the eighth Text REtrieval Conference (TREC-8). In Proceedings of TREC-8, pages 1­24, Gaithersburg, MD, 1999. NIST Sp ec.Publ. 500-246. [25] I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, San Francisco, CA, 1999. [26] J. Zob el and A. Moffat. Exploring the similarity space. SIGIR Forum, 32(1):453­490, 1998. 406