SIGIR 2007 Proceedings Session 19: Multi-Lingual IR Building Simulated Queries for Known-Item Topics An Analysis using Six E uropean Languages Leif Azzopardi Dept. of Computing Science University of Glasgow, Glasgow G12 8QQ Maar ten de Rijke ISLA, University of Amsterdam Kruislaan 403, 1098 SJ Amsterdam Krisztian Balog ISLA, University of Amsterdam Kruislaan 403, 1098 SJ Amsterdam leif@dcs.gla.ac.uk ABSTRACT mdr@science.uva.nl kbalog@science.uva.nl There has b een increased interest in the use of simulated queries for evaluation and estimation purp oses in Information Retrieval. However, there are still many unaddressed issues regarding their usage and impact on evaluation b ecause their quality, in terms of retrieval p erformance, is unlike real queries. In this pap er, we focus on methods for building simulated known-item topics and explore their quality against real known-item topics. Using existing generation models as our starting p oint, we explore factors which may influence the generation of the known-item topic. Informed by this detailed analysis (on six Europ ean languages) we prop ose a model with improved document and term selection prop erties, showing that simulated known-item topics can b e generated that are comparable to real known-item topics. This is a significant step towards validating the p otential usefulness of simulated queries: for evaluation purp oses, and b ecause building models of querying b ehavior provides a deep er insight into the querying process so that b etter retrieval mechanisms can b e develop ed to supp ort the user. Categories and Subject Descriptors H.3.4 [Information Storage and Retrieval]: Systems and Software--performance evaluation General Terms Exp erimentation Keywords Query simulation, query generations, evaluation, multilingual retrieval 1. INTRODUCTION Evaluation plays a central role in Information Retrieval b ecause it enables the b enchmarking and comparison of different retrieval techniques [17]. However, manually building Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. Copyright 2007 ACM 978-1-59593-597-7/07/0007 ...$5.00. test collections for evaluation is a time consuming and exp ensive process. With more and more collections b ecoming available and only finite resources available, the range of tasks that can b e evaluated is restricted. A cost-effective alternative to manually building test collections is to construct simulated (or artificial) test collections instead [1, 11, 15, 16]. While this involves a numb er of compromises regarding the realism of the generated test collection, the solution has many b enefits. Simulation provides an inexp ensive avenue for testing, training, and evaluating retrieval algorithms along with the ability to precisely control the exp erimental conditions. For instance, the length (i.e., long vs. short), style (highly discriminative terms vs. p opular terms), quality (noisy query terms, translations, etc) and numb er of queries that can b e produced for a given topic can b e greatly varied to define a sp ecific scenario. This enables selective evaluation of particular query typ es. By considering different query typ es the relationship b etween query characteristics and algorithm p erformance can b e b etter understood and help guide the development of retrieval algorithms and query p erformance prediction [12]. Recently, there has b een numb er of studies which have constructed a simulated test collection for evaluation and training purp oses [1, 11, 12]. However, little research has b een p erformed on investigating their validity, utility and limitations for evaluation. Consequently, there are many op en questions and issues concerning their use. What is the information need? How should topics b e generated? Can the models of query generation produce reasonable or interpretable queries? Are these queries really like user queries? Do they give the same indication of p erformance, or the same ranking of systems? And what ab out relevance judgments? So while using simulated test collections is costeffective and p otentially useful in the context of evaluation, they are not well understood [10]. The focus of this study is on one particular typ e of simulated test collections built from the automatic generation of known-item topics [1, 11]. The approach generates knownitem topics by selecting a document, the known-item, and producing a query for that known item. Since the task of known-item finding has a clear and precise semantics (i.e., find the known item), this removes issues relating to acquiring user judgements or defining an information need (as it is implicit in the task). The main concern for these test collections, is the production of the known-item topics and whether they are representative or reflective of actual known items. The challenge is to develop models that produce 455 SIGIR 2007 Proceedings Session 19: Multi-Lingual IR known-item topics that are comparable to manually created known-item topics. Previous studies using simulated topics have resulted in varying levels of p erformance. A very early study found that simulated queries for ad hoc retrieval p erformed very p oorly compared to real queries [15, 16]. However, more recent studies have shown mixed p erformances from simulated queries. Azzopardi and de Rijke [1] rep ort that p erformance for known-item finding queries is reasonably similiar to real queries, but either somewhat lower or somewhat higher. During WebCLEF 2006 simulated queries were also used to generate queries for the known-item task in over 20 languages [2]. Simulated topics resulted in substantially p oorer p erformance than manual topics for many of the languages. As a result, only a weak correlation b etween the ranking of systems using simulated and real queries was found [2]. In sum, current models for generating simulated topics do not app ear to b e providing comparable p erformance to manual topics. The cause of this problem, we b elieve, is that the models used to generate the queries and topics are not a good representation of the actual querying processes. In this pap er, we examine the problems of developing useful simulated queries for known-item finding, and attempt to identify generation methods that produce topics that are comparable to real topics. On six different Europ ean language collections, we test current models of known-item generation. Our main contribution consists of two parts: (1) a detailed analysis of a numb er of factors that impact the p erformance of the generated queries; and (2) extensions and refinements of current query models in terms of term selection and non-uniform document priors. This results in query models whose replicative validity can b e established. The remainder of this pap er is structured as follows. In the next section, we briefly review the related work in this area b efore describing a generative probabilistic approach to query generation for known-item finding tasks. Then, in Section 4 we p erform an empirical analysis on the WebCLEF multilingual web retrieval tracks using different query models. We analyse and revise the models in Section 5 and 6. Finally, we discuss our findings and conclude in Section 7. 2. RELATED WORK Simulated queries have b een used in the past for a variety of applications, including parameter estimation, sampling, and for simulated topics. In [15] we find an early attempt to not only generate topics, but also the collection, based on the distributions of real collections and queries. A model for generation is prop osed to construct documents and topics--queries and relevance judgments for ad hoc retrieval. To determine whether their simulated model was comparable to the real systems, they tried to validate the model. According to Zeigler [18], there are three kinds of validation that can b e p erformed on a simulation; predictive, structural and replicative. A model has predictive validity if it can produce the same data output as the real system (i.e., comparing the query terms for a given known item from the simulated model and real system). A model has structural validity if the way it op erates is a reflection of how the real system op erates. And replicative validity is if the model produces output that is similar to the output of the real system. Tague et al. [15] focus on replicative validity by comparing the p erformance of the simulated queries to the p erformance of the real queries and seeing if they were drawn from the same distributions. However, their simulated collection and topics resulted in very p oor p erformance and was not comparable to real topic. Consequently, they were unable to produce a model with replicative validity. In this pap er, we adopt their methodology for testing whether the p erformance of the topics of the simulated is comparable to the p erformance of real queries; see Section 4. Berger and Lafferty [4] prop osed a sophisticated language model (the translation model) which required significant training to set the model parameters effectively. Query and document pairs were required as training data, where queries were created to obtain these pairs. Queries were formed by taking the title of the document to b e one example of a query for that document. These were then used for the estimation process. Callan and Connell [7] prop osed Query Based Sampling which used random queries to prob e collections. Queries consisted of a single term, that has b een drawn from a sample p opulation of documents according to different selection strategies, such as: term frequency, document frequency, inverse document frequency, or average term frequency. The documents returned in resp onse to the query were used to form a representation of the collection. It was b elieved that random queries, while unrealistic, would lead to an unbiased representation of the resource. Thus, there was no requirement for the queries to b e realistic. In [11], a synthetic test collection for annotation retrieval is created by generating queries for a particular document by randomly sampling terms without replacement. Then, a corresp onding annotation is generated by again randomly sampling terms without replacement from the document. The idea was to create query-annotation pairs where the queries did not match the annotation so that they could evaluate this context. Jordan et al. [12] generated simulated queries which they refer to as "controlled" queries, which were used to estimate parameters in psuedo-relevance feedback. They generated queries from a set of pseudo-relevant documents using a numb er of different selection strategies: (1) single term highly discriminative queries, (2) two term query queries comp osed of the highest discriminative term and the other term was selected at random, and (3) varied length queries, where terms were selected which had the highest relative entropy (i.e., the highest term discrimination). The relative entropy was computed with resp ect to the previous terms (i.e., term dep endence). This process is analogous to query expansion, where extra query terms are selected to add to the query; for instance, Cai et al. [5] used the most discriminative terms from top ranked documents. Azzopardi and de Rijke [1] focus on generating queries for known-item finding. Here, the querying b ehavior of users for a known item is formalized within a generative probabilistic framework. They generated known-item topics on the TREC Enterprise collection and showed that generated queries which were more discriminative p erformed b etter than queries that were randomly selected or selected prop ortional to term frequency. However, in absolute numb ers the p erformance of the generated queries was less than the p erformance of manual queries. A similar result was found at WebCLEF 2006 [2]. This disparity has prompted this study into the reasons why generated queries were not as successful, and how to improve the current models of query 456 SIGIR 2007 Proceedings Session 19: Multi-Lingual IR generation. We adopt the framework set out in [1] for studying query generation b ecause it enables many different models to b e devised, of varying sophistication, while still b eing intuitive and simple. We describ e the generative process in more detail in the following section. It has b een acknowledged in the literature that the simulated queries produced are far from realistic. However, since the cost of building test collections is very high, this motivates addressing this problem, b ecause one of the main advantages of simulated queries is that numerous queries can b e produced at minimal cost along with the ability to generate a multitude of different querying b ehaviors [1, 12]. This has huge ramifications for evaluating systems b ecause it provides finer grained control over the exp erimental conditions that would b e otherwise imp ossible in a real environment. Further, by focusing on building models of query generation we can develop a b etter understanding of how users query a system, where we can then develop mechanisms which help to address the users' information retrieval tasks. Such query models are also useful for identifying structure and context within the query [6], estimating the difficultly of the query for p erformance prediction [8], automatically completing queries by predicting subsequent terms [3], and automatic query expansion[5]. to b e the definition of the user's language model of the document p(ti |dk ), from which the query terms are sampled. dk Formally, the user's querying model p(·|m ) can b e expressed as shown in Eq. 1, where m is the model of the user querying b ehavior for the document dk . The process is a mixture b etween sampling from the document and sampling from the collection (or noise): dk p(ti |m ) = (1 - ) · p(ti |dk ) + · p(ti ) (1) 3. SIMULATED KNOWN-ITEM QUERIES In [1], a probabilistic framework is presented for the generation of simulated queries for the task of known-item search. This approach models the following b ehavior of a knownitem searcher. It is assumed that the user wants to retrieve a particular document that they have seen b efore in the collection, b ecause some need has arisen calling for this document. The user then tries to re-construct or recall terms, phrases and features that would help identify this document, which they p ose as a query. The basic algorithm for generating known-item queries is based on an abstraction of the actual querying process, where the following steps are undertaken: · Initialize an empty query q = {} · Select the document dk to b e the known-item with probability p(dk ) · Select the query length s with probability p(s) · Rep eat s times: ­ Select a term ti from the document model of dk with probability p(ti |d ) ­ A d d t i to the q uery q . · Record dk and q to define the known-item/query pair. Since the known item is the relevant document there is no need for any explicit relevance judgments. The b enefits are immediately obvious. By rep eatedly p erforming this algorithm numerous queries can b e generated quickly and inexp ensively. But b efore this can b e p erformed, the probability distributions p(dk ), p(s), and p(t|dk ) need to b e defined. These distributions are an imp ortant part of the generative process as they characterize the b ehavior of the user which we are trying to simulate. By using different probability distributions the various typ es and styles of queries can b e generated. The distribution with the biggest influence app ears Given this user querying model, the quality of the query generated can b e directly influenced by varying the parameter. As tends to zero, the user's recollection of the original document improves. Conversely, as tends to one, the user's memory of the document degrades. If = 1, then the user knows the document exists but they have no idea as to which terms app ear in the document (and randomly selects query terms). To provide different typ es of user querying b ehaviors, then, it is imp ortant to define the probability distribution defining the likelihood of selecting the term ti from the document dk , i.e., p(ti |dk ). The three broad sampling strategies that have b een used to characterize this selection process are: (1) p opular, (2) random, and (3) discriminative [1, 7, 12]. Popular selection considers the use of information such as term frequency, location, etc. which makes the term stand out in some way to the user so that the user recalls this feature. We capture p opularity by assuming that more frequent terms are more likely to b e used as query terms and use the empirical probability of a term in the document to represent this selection strategy: p(ti |dk ) = n(t , d ) tik , (t , d k ) ni i (2) where n(ti , dk ) is the numb er of occurrences of ti in dk . Random selection makes the assumption that the user will indiscriminately recall terms in the document. Admittedly, this does not seem like a very likely strategy that a user may have when issuing a query, but it provides a baseline to compare other selection strategies, as this is the most naive. The probability distribution for the random selection strategy is set as: p(ti |dk ) = t b(t i , d k ) , b(t j , d k ) j dk (3) where b(ti , dk ) denotes the presence of a term in a document, where b(ti , dk ) = 1 if ti occurs in dk , and zero otherwise. Discriminative selection assumes that the user will consider information outside the document, and that they will consider the document with resp ect to the collection. That is, the user may try to select query terms that will discriminate the document they want from the other documents in the collection. Selection of this kind is generally based around the informativeness of a term; the probability distribution we define to represent this strategy is: p(ti |dk ) = p (t i ) · b(t i , d k ) , t b(tj ,dk ) j dk (4) p( t j ) where p(tj ) is the probability of a term occurring in the collection defined by the maximum likelihood estimate (i.e. p(ti |dk ) is prop ortional to the inverse collection frequency of a term). 457 SIGIR 2007 Proceedings Session 19: Multi-Lingual IR Table 1: Basic statistics of manual queries. Number of Query length Lang. Docs. Terms Qrys. Min. Max. Avg. ES 33,772 1,399,309 143 1 13 6.01 UK 66,345 1,768,353 68 2 12 5.14 NL 148,040 887,819 83 1 9 3.49 PT 146,563 576,126 56 2 17 5.98 HU 324,961 537,220 44 1 9 3.50 DE 438,481 873,631 70 1 7 3.12 4. EXPERIMENTAL SET-UP How well do the query generation models introduced in Section 3 work? More sp ecifically, can we establish replicative validity for them? Our goal is to b etter understand the process of query generation through an extensive study of factors influencing the quality of the known-item topics generated. To this end we generate known-item topics using multiple document collections, in six Europ ean languages. After describing the collections, we describ e the manual topics used for comparison, the test method used for the comparison, and the results. In Section 5 we follow with a detailed analysis of the results. of the corresp onding manual queries. Tague and Nelson [16] validated whether the p erformance of their generated queries was similar to real queries across the p oints of the precision-recall graph using the Kolmogorov-Smirnov (KS) Test. Here, we compare the Mean Recip oral Rank (MRR) of each model against the MRR of the manual queries using the KS Test as a way to validate the query models. The KS Test is an indep endent two-sample test which is used to test the hyp othesis that the two samples may reasonably b e assumed to come from the same distribution. So if the two distributions of MRRs are not significantly different, we can say that the query model produces known-item topics which are comparable to manual queries (in terms of p erformance). Otherwise, if they are significantly different, the query models produce known-item topics which are not comparable, resulting in p erformance which is significantly lower or higher. Results. Table 2 presents the performance of the generated queries and the manual queries on three p opular, but different, retrieval models: TF.IDF, OKAPI BM25, and a Language Model using Bayes Smoothing with a Dirichlet Prior set to 2000. An asterisk denotes whether the p erformance is comparable to the manual queries (i.e., not significantly different), otherwise the p erformance is not comparable (i.e., significantly different) according to a two tailed KS-Test where the significance level was set to 5%. Table 2: Mean Reciprocal Rank: On manual and simulated queries. Lang. Query Type TF.IDF OKAPI LM Manual 0.2217 0.3102 0.2482 Mo del 1 0.1805* 0.2383* 0.2195* ES Mo del 2 0.1742* 0.2733* 0.2331* Mo del 3 0.3109 0.3675 0.3636 Manual 0.3548 0.4836 0.4232 Mo del 1 0.3047* 0.4741* 0.4475* UK Mo del 2 0.1941 0.4440 0.3381 Mo del 3 0.6359 0.6778 0.6711 Manual 0.4158 0.6133 0.4842 Mo del 1 0.1136 0.1763 0.1729 NL Mo del 2 0.0885 0.1910 0.1458 Mo del 3 0.2703 0.2940 0.3119 Manual 0.0866 0.2161 0.1362 Mo del 1 0.0763 0.1165 0.0951 PT Mo del 2 0.0380 0.0620 0.042 Mo del 3 0.1540* 0.2011* 0.1868* Manual 0.2812 0.3683 0.2754 Mo del 1 0.0148 0.0086 0.0064 HU Mo del 2 0.0173 0.0346 0.0405 Mo del 3 0.0338 0.0315 0.0321 Manual 0.3231 0.5038 0.3588 Mo del 1 0.0258 0.0464 0.0473 DE Mo del 2 0.0402 0.0714 0.0484 Mo del 3 0.0286 0.0274 0.0274 On insp ection of the results, we see that Model 3 generates queries which obtain higher MRR scores than the other models. This is to b e exp ected, as model 3 should generate queries with more discriminative terms. And, with resp ect to retrieval models, OKAPI consistently p erformed the b est Data. The document collections we used are from the EuroGOV corpus [13], a multilingual web corpus built from a crawl of Europ ean government-related sites, which contains over 3.5 million pages from 27 primary domains, covering over twenty languages; there is no single language that dominates the corpus. From this corpus, we selected only those languages which had a sufficiently large numb er (over 40) known-item topics for that domain and language. This was to ensure that we had enough examples to b e able to observe statistically significant differences. This restriction resulted in using the following languages from the domains: German (DE), Spanish (ES), Hungarian (HU), Dutch (NL), Portuguese (PT), and English (UK). Each domain from the EuroGOV collection was individually indexed, using language-sp ecific stopword lists, but we did not apply stemming. Table 1 shows the collection statistics, along with the numb er of manual known-item topics (Qrys.), and the average, minimum, and maximum query lengths with stopwords removed. Known-Item Topics. We used the (manually created) knownitem queries for WebCLEF 2005 and 2006 [2, 14] as a reference p oint. We produced a set of simulated queries for each of the six languages (DE, ES, HU, NL, PT, UK) from three user query models using: (Model 1) p opular sampling (i.e., Eq. 2); (Model 2) random sampling (i.e., Eq. 3); and (Model 3) discriminative sampling (i.e., Eq. 4). The amount of noise on all models was = 0.2, which reflects the amount of noise on average within the manual queries. The length of the queries generated where drawn from a p oisson distribution with the mean set according to the average length of a query (rounded to the nearest whole numb er) given the particular language. In all, 100 known-item query-document pairs were generated for each model and each language. Validation of Query Models. In order to establish replicative validity of a query model we need to determine whether the generated queries from the model are representative 458 SIGIR 2007 Proceedings Session 19: Multi-Lingual IR (or close to the b est) regardless of query model or language. The absolute scores for the simulated queries for ES, UK, PT, and to some extent NL, are all in the same general range as the absolute scores for the manual queries. However, there is a quite clear difference in p erformance b etween manual and simulated queries for DE and HU. When we consider whether the p erformances of the simulated queries are comparable to the manual queries, we find that Model 1 and 2 for ES, Model 2 for UK, and Model 3 for PT are not significantly different (regardless of retrieval model). However, for all the other languages and models the p erformance is significantly different. And so, for NL, HU and DE none of the query models generated known-item topics that are comparable to manual topics. Put differently, so far we have not established replicative validity for all query models on all languages. We have established that sp ecific models can achieve replicative validity on particular languages using particular query models. Table 3: Mean Reciprocal Rank: On simulated queries with reduced collection size. Lang. Query Type TF.IDF OKAPI LM Mo del 1 0.0380 0.0654 0.0562 DE Mo del 2 0.0527 0.0550 0.0626 1/3 Mo del 3 0.1022 0.1039 0.1129 Mo del 1 0.0401 0.0418 0.0427 DE Mo del 2 0.0361 0.0683 0.0800 1/6 Mo del 3 0.0967 0.1147 0.1299 Mo del 1 0.0473 0.0624 0.0538 HU Mo del 2 0.0502 0.0702 0.0552 1/3 Mo del 3 0.0904 0.0915 0.0933 Mo del 1 0.0324 0.0572 0.0594 HU Mo del 2 0.0505 0.0504 0.0565 1/6 Mo del 3 0.0599 0.0746 0.0754 these occurred only in a handful of documents). Obviously, more index terms available for selection when submitting a query will enable the selection of terms that will b e more likely to retrieve the document, however, in manual queries in HU, DE, and NL all p erform relatively well. The impact of the size of the vocabulary does not seem to affect their p erformance as much. So instead of vocabulary size, we examine the document frequency of the terms in the collection and those in the queries. 5. ANALYSIS In this section we provide a detailed analysis of the results obtained in Section 4. The most striking finding in Section 4 is that the p erformance of the simulated queries is generally quite lower than the manual queries (i.e., for NL, HU and DE, and for Model 1 and 2 for PT) when the p erformance is not comparable (the exceptions are model 3 on ES and UK, which result in considerably b etter p erformance). In most cases the MRR scores were extremely low and resulted from many of the simulated known-item topics ending in complete failure (i.e., not retrieving the known item within the top 1000 documents). Consequently, the prop osed models do not always produce queries which p erform like manual queries. Human users must b e generating topics in a different manner such that they are more likely to b e successful. We have examined several related factors which app ear to affect the quality of the queries which may influence the way in which users construct successful queries: the size of the collection, the size of the vocabulary, the document frequency of terms in the collection, the document frequency of query terms, and the imp ortance of a document (as estimated by its inlink count). Note, while this is not an exhaustive list of factors, it represents a reasonable set to investigate initially. Below, we consider each in turn. DF of Terms in Collection. When we consider the discriminative p ower of terms (according to document frequency (DF)) in each collection, we notice that the HU and DE collections contained terms which occur in many more documents than the other languages. Table 4 shows the DF of the n-th term, ordered by DF. Submitting a term from the top 10,000 would result in over 870 and 1,635 documents b eing returned for DE and HU, resp ectively, whereas for ES and UK, the numb er of documents is smaller by almost an order of magnitude. Obviously, this is related to the numb er of documents in the collections, but suggests it is more difficult to select query terms that have sufficient discriminative p ower in these collections compared to others. Terms in the HU and DE collections app ear in so many more documents that using any such term will return many documents (which will most probably lower the MRR). And so, terms from these languages are less discriminative than UK or ES terms, and taking into account the size of the vocabulary this means they also have fewer of them. Hence, the selection of query terms in such cases needs to b e highly discriminative and p opular, in order to identify the known item with high MRR. Collection Size. First, as the document collections increase in size the quality of the generated queries app ears to decrease. Intuitively, it is more difficult to pin p oint the document in the larger collections (i.e., HU and DE). To test whether collection size has a large impact on the p erformance, we p erformed further exp eriments using 1/3rd and 1/6th of the DE and HU collections in order to reduce their size. While small improvements are obtained (See Table 3), they are still not comparable (i.e., simulated queries are still significantly different from manual ones). So we b elieve some other factor is likely to have a greater influence. Vocabulary Size. If we consider the size of the vocabulary, it is much smaller in HU and DE than in UK or ES. For example, the UK collection has ab out 1.7 million terms, while the DE collection has 0.8 million terms. Consequently, ab out 0.9 million more terms can b e issued in English queries, which can b e used to narrow down the search (and most of Table 4: DF of Terms in the Collection. Number of documents given the term at different ranks, ordered by DF. Lang. 100 1K 10K 50K ES 4,909 2,249 147 32 UK 13,408 3,336 186 25 NL 27,288 5,466 350 20 PT 91,350 3,512 135 12 HU 92,975 15,563 1,635 28 DE 94,330 19,780 870 42 459 SIGIR 2007 Proceedings Session 19: Multi-Lingual IR Table 5: of query Lang. ES UK NL PT HU DE Mean average DF and mean minimum DF terms. Stat. Man. Mod.1 Mod.2 Mod.3 Avg. 945 1,011 686 413 Min. 587 573 573 257 Avg. 2,876 2,269 2,095 1,059 Min. 1,294 1,240 1,141 171 Avg. 5,021 8,747 9,812 2,657 Min. 3,246 7,053 6,574 1,115 Avg. 12,388 20,440 21,500 10,541 Min. 6,807 17,637 17,537 6,358 Avg. 14,139 46,151 35,910 35,052 Min. 8,570 43,898 28,996 25,347 Avg. 18,707 29,655 50,582 18,057 Min. 4,243 18,003 33,134 13,621 Table 6: Inlink statistics for the collection, manual known items and simulated known items. Average number of inlinks in the: Manual Simulated topics Lang. Coll. topics w/o prior w. prior DE 7.3 142.9 2.1 2925.6 ES 5.4 21.1 3.9 168.8 HU 15.6 9763.8 2.1 40490.2 NL 20.3 388.8 9.2 4273.2 PT2 0.001 0.169 0.0 0.0 UK2 0.002 0.010 0.0 0.015 These collections do not have many links within the collection (which results in the very low values). normalization on the effectiveness of the query generation process. For these languages de-comp ounding could have b een employed--b efore or after the query sampling process. In the former case it could increase the size of the vocabulary, and it could do so in a way that drastically changes the statistics of terms [9]. However, whether this, or the other factors, would affect the replicative validity of the generation models, are left for future analysis and consideration. 2 DF of Terms in Queries. Next, we examined the document frequency (DF) of the terms in the queries (See Table 5). The DF of query terms from the generated queries were far higher than the manual queries, esp ecially on the HU and DE collections. For example, for HU manual queries the average DF was 14,139, while for the three models of simulated queries the average DF was 46,151, 35,910 and 35,052, resp ectively. Consequently, the user querying models did not app ear to sufficiently favor terms that were as discriminative as those in the manual queries. However, for ES and UK queries the difference in average DF was small, which attributes to their comparable p erformance against the manual queries on models 1 and 2. Conversely, the DF was somewhat lower for these queries on model 3, which is reflected in the higher p erformance. Document Importance. Here, we consider whether the imp ortance of a document has an impact on the generation of a known item. We estimate the imp ortance of a document in terms of its inlinks. Table 6 shows the average numb er of inlinks for a page in the collection, manual topics, and the simulated topics.1 The average numb er of inlinks in the collection is quite low, but the targets of manual knownitem queries had substantially more inlinks. Documents of simulated known items created using the uniform document prior, had very low inlink counts that did not resemble the manual known items in this resp ect. Since manual topics are generated by selecting documents that tend to b e linked more, it seems sensible that this prior on selection should b e incorp orated into the query generation process. However, it is not clear why this would provide any improvements in p erformance, as the retrieval models considered in this pap er do not use any link information when scoring documents! 6. IMPROVING SIMULATED TOPIC PERFORMANCE The analysis in the previous section suggests that two factors are quite different b etween simulated and manual topics: (1) the discrimination of query terms, and (2) the numb er of inlinks of known items. To determine whether theses two factors do have an impact on the quality of simulated topics, we p erform a series of follow-up exp eriments where we use a different selection strategy and a non-uniform document prior. By applying these changes to the generation sequence we hop e to improve the p erformance of the topics so that they are more realistic (in the sense that they share more of the characteristics of manual known items). Hop efully, this will result in query models whose replicative validity can b e established. Improving the Term Discrimination. To improve the discrimination of query terms selected we prop ose a combined strategy of p opular and discrimination, where terms that are highly discriminative within the collection, and also very p opular within the document are more likely to b e selected. To represent this strategy we make the probability of a term b eing selected prop ortional to the tf/idf of the term in the document. The Popular + Discrimination selection selection strategy is defined as: p(ti |dk ) = n(ti , dk ) · log dfN i ) (t n , t (tj , dk ) · log df NJ ) dk (t j (5) Other Factors. There are any number of other factors that could b e influencing the quality and realism of the queries produced for a known item. For instance, the selection of bigrams, the selection of terms from structured fields, variations in selection strategies (i.e., querying is a mixture of different query models dep ending on the state of the user), length of a known item, etc. Another factor, sp ecific to DE, HU and NL is the p ossible influence of morphological 1 Inlink counts for simulated known-item topics are averaged over all topics generated for each collection. where N is the numb er of documents in the collection and df (·) is the document frequency of a term. For each of our six collections, we generated another 100 queries using the p opular+discrimination selection strategy (Model 4). All other parameter values are held the same as in the initial exp eriments (Section 3). Table 7 rep orts the p erformance of the topics using Model 4, without inlink priors. Using the p opular+discriminative selection strat- 460 SIGIR 2007 Proceedings Session 19: Multi-Lingual IR Table 7: Mean Reciprocal Rank: On topics generated using Model 4, without inlink prior. Lang. Query Type TF.IDF OKAPI LM ES Model 4 0.3425 0.4622 0.4267 UK Model 4 0.5843 0.7013 0.6726 NL Model 4 0.2531 0.3576 0.3314 PT Model 4 0.1478* 0.2017* 0.1713* HU Model 4 0.0623 0.0739 0.0661 DE Model 4 0.0462 0.0535 0.0482 egy (Model 4) produces results similar to the discriminative selection strategy (Model 3) and this only results in producing comparable queries for the PT collection. Unfortunately, then, trying to improve the term selection strategy does not succeed in increasing the p erformance for the HU, DE and NL collections. As a consequence, we still do not have replicative validity for all languages. While it app eared that the more discriminative term selection strategies (Models 3 and 4) would b e more suitable for the latter languages (DE, HU and NL), b ecause terms in these languages had a high DF, this did not necessarily translate in significantly b etter retrieval p erformance. We p osit that users combat the DF problem by issuing sp ecific combinations of terms, which result in a lower DF when combined. Consequently, indep endently selecting terms during the process may not sufficiently replicate this phenomenon. This suggests that in order to build query models whose outputs are more comparable to manual queries, the term dep endencies b etween query terms need to b e captured, i.e., term selection based on the conditional probability given the previous query terms (i.e., p(tt |ti-1 , . . . , t1 , dk )). Table 8: Mean Reciprocal Rank: On queries with inlink Prior Lang. Query Type TF.IDF OKAPI Mo del 1 0.2533 0.4467 ES Mo del 2 0.2786* 0.4099* Mo del 3 0.5620 0.6368 Mo del 4 0.3523 0.5418 Mo del 1 0.3334* 0.4994* UK Mo del 2 0.3602* 0.5171* Mo del 3 0.7741 0.7640 Mo del 4 0.5526 0.6812 Mo del 1 0.1662 0.2898 NL Mo del 2 0.1637 0.2394 Mo del 3 0.6464 0.6189* Mo del 4 0.3151 0.4392* Mo del 1 0.1383 0.1729 PT Mo del 2 0.1395 0.1370 Mo del 3 0.1979* 0.2249* Mo del 4 0.1468* 0.1796* Mo del 1 0.2707* 0.2753* HU Mo del 2 0.3317* 0.3228* Mo del 3 0.6908 0.6181 Mo del 4 0.5431 0.4688* Mo del 1 0.1445 0.1762 DE Mo del 2 0.1878 0.1615 Mo del 3 0.4012* 0.3767 Mo del 4 0.2625 0.3179 simulated LM 0.3577 0.3430* 0.6080 0.4563 0.4363* 0.4883* 0.7719 0.6279 0.2453 0.2013 0.6204* 0.3751* 0.1712 0.1653 0.2200* 0.1654* 0.3166* 0.4149* 0.6033 0.4664 0.1405 0.1847 0.4092* 0.3093 Encoding the Document Importance. From our analysis of inlink counts in the previous section, it seems reasonable to set a non-uniform document prior for the selection of the known item. This is to encode the querying b ehavior of users where we have seen that they are more likely to retrieve a known item that is imp ortant (defined by the numb er of inlinks). To examine this prop osition we set the document prior p(dk ) to b e: p (d k ) = in(dk ) + 1 d , (in(di ) + 1) i (6) where in(dk ) is the numb er of inlinks to document dk and the summation ranges over the entire document collection. We now turn to an empirical examination of the four query models introduced so far, but now with a non-uniform document prior as defined in Eq. 6. For each of our six collections and four models we generated another 100 queries. Table 8 shows the p erformance of each of the query models for each collection and retrieval model. Clearly, the nonuniform prior has a significant impact on the p erformance. From the numb er of inlinks of known items with the nonuniform prior (shown in last column of Table 6), it would seem that the prior is too strong and results in documents with a larger than average numb er of inlinks than manual topics. Regardless, the p erformance on these queries is surprisingly high, despite the fact that none of the retrieval models use any link information. For all languages we obtain substantial increases to the MMR for the ma jority of the query models. Sp ecifically, for ES, model 2 queries are comparable to manual topics instead of models 1 and 2 (in the uniform prior case), while for the UK collection, model 1 and 2 queries are now comparable as opp osed to only model 1 previously. In all other instances, the p erformance of the models was substantially higher than what manual queries obtain. The application of the inlink prior in the process results in simulated topics for HU on models 1 and 2, and PT on model 3, that are now comparable with the manual topics. The generation process has b een modified to reflect the desire of seeking imp ortant pages, and this has ameliorated the retrieval p erformance, to fall in line with manual queries. For the DE and NL simulated queries some comparable p erformance is now found with Model 3, but only on two of the retrieval models. These results show that the knownitem document selected is imp ortant and improves p erformance (substantially so in the case of DE topics), but the simulated topics are still under p erforming compared to the manual topics. 7. DISCUSSION We analyzed the p erformance of simulated known-item queries in six Europ ean languages where the differences in languages considered raised many issues concerning the automatic generation of queries. Different languages require different models to cater for discrimination of terms, and models which do not arbitrarily select random known items but known items which tend to b e more imp ortant. The latter change to models previously prop osed in the literature is very surprising b ecause despite the retrieval models not using any link information the p erformance of these queries is substantially b etter across all collections. This reveals an interesting pattern of b ehavior on b ehalf of the user: they prefer to retrieve known items which tend to b e more imp ortant. 461 SIGIR 2007 Proceedings Session 19: Multi-Lingual IR This difference in p erformance b etween generation models with and without the prior on imp ortance suggests that imp ortant pages have certain characteristics which make them more likely to b e retrieved highly in the ranking, regardless of retrieval model. Retrieving a random document in the collection is substantially more difficult than retrieving more imp ortant documents. While querying models currently employed do not always adequately capture the querying process that a user adopts when formulating a known-item query, we have identified several areas where the process can b e improved. For particular languages certain strategies have proved to b e more appropriate. It is p ossible to generate queries for knownitem topics that have comparable p erformance to the manual queries for UK and ES with Popular/Uniform strategies, while the other languages are b etter with more discriminative term selection models coupled with the imp ortance prior. However, further research is required to develop more sophisticated models of known-item topic generation which extend the process further to consider other factors, such as term dep endence, document length, de-comp ounding, etc. 10. REFERENCES [1] L. Azzopardi and M. de Rijke. Automatic construction of known-item finding test beds. In Proc. 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 603­604, 2006. [2] K. Balog, L. Azzopardi, J. Kamps, and M. de Rijke. Overview of WebCLEF 2006. In A. Nardi, C. Peters, and J. Vicedo, editors, Working Notes CLEF 2006, Sept 2006. [3] H. Bast and I. Weber. Type less, find more: fast autocompletion search with a succinct index. In Proc. 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 364­371, 2006. [4] A. Berger and J. Lafferty. Information retrieval as statistical translation. In Proc. 22nd ACM SIGIR Conference on Research and Development in Information Retrieval, pages 222­229, Berkeley, CA., 1999. [5] D. Cai, C. J. van Rijsbergen, and J. Jose. Automatic query expansion based on divergence. In Proc. 10th International Conference on Information Know ledge Management, pages 419­246, 2001. [6] P. Calado, A. S. da Silva, R. C. Vieira, A. H. F. Laender, and B. A. Ribeiro-Neto. Searching web databases by structuring keyword-based queries. In Proc. 11th international conference on Information and know ledge management, pages 26­ 33, New York, NY, USA, 2002. [7] J. P. Callan and M. E. Connell. Query-based sampling of text databases. Information Systems, 19(2):97­130, 2001. [8] S. Cronen-Townsend, Y. Zhou, and W. B. Croft. Predicting query performance. In Proc. 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 299­306, 2002. [9] V. Hollink, J. Kamps, C. Monz, and M. de Rijke. Monolingual document retrieval for European languages. Information Retrieval, 7:33­52, 2004. [10] M. Inoue. The remarkable search topic-finding task to share success stories of cross-language information retrieval. In Proc. SIGIR 2006 Workshop on New Directions in Multilingual Information Access, 2006. [11] M. Inoue and N. Ueda. Retrieving lightly annotated images using image similarities. In Proc. 2005 ACM symposium on Applied computing, pages 1031­1037, 2005. [12] C. Jordan, C. Watters, and Q. Gao. Using controlled query generation to evaluate blind relevance feedback algorithms. In Proc. 6th ACM/IEEE-CS joint conference on Digital libraries, pages 286­295, 2006. o [13] B. Sigurb j¨rnsson, J. Kamps, and M. de Rijke. EuroGOV: Engineering a multilingual Web corpus. In C. Peters, F. C. Gey, J. Gonzalo, G. Jones, M. Kluck, B. Magnini, H. Muller, ¨ and M. de Rijke, editors, Accessing Multilingual Information Repositories, LNCS 4022. Springer Verlag, 2006. [14] B. Sigurb j¨rnsson, J. Kamps, and M. de Rijke. Overview o of WebCLEF 2005. In C. Peters, F. Gey, J. Gonzalo, H. Muller, G. Jones, M. Kluck, B. Magnini, and M. de Rijke, ¨ editors, Accessing Multilingual Information Repositories, LNCS 4022, pages 810­824. Springer, Sept 2006. [15] J. Tague, M. Nelson, and H. Wu. Problems in the simulation of bibliographic retrieval systems. In Proc. 3rd annual ACM conference on Research and development in information retrieval, pages 236­255, 1981. [16] J. M. Tague and M. J. Nelson. Simulation of user judgments in bibliographic retrieval systems. In Proc. 4th annual international ACM SIGIR conference on Information storage and retrieval, pages 66­71, 1981. [17] E. M. Voorhees. The philosophy of information retrieval evaluation. In Revised Papers from the Second Workshop of the Cross-Language Evaluation Forum on Evaluation of Cross-Language Information Retrieval Systems, pages 355­ 370, London, UK, 2002. Springer-Verlag. [18] B. P. Zeigler. Theory of Model ling and Simulation. Wiley, 1976. 8. CONCLUSIONS We have analyzed a numb er of factors which affect the quality of the queries produced by previous query generation models. This has led to refinements of the query generation process, employing an improved term selection method and document priors based on inlink counts. Known-item topics can now b e simulated in such a way that they produce similar p erformance to manually created know-item topics. Put differently, the generation models are replicatively validated: the p erformance of the output of b oth processes is sampled from the same underlying distribution. Thus, we have identified sp ecific generation models for sp ecific languages that produce simulated queries that are replicatively valid. While this is an imp ortant step and contribution toward using simulated known-item topics, further work is required in order to build models of query generation that are not only replicatively, but also predictively valid. E.g., we have not examined whether the simulated queries, themselves, are similar to real queries for a given known item. I.e., are the query terms generated the same as the real terms? This is also left to further work, along with examining a numb er of other arising research questions, such as: why are particular documents favored by the user, but also easier to retrieve, and why are random documents harder to find? And, do system rankings resulting from replicatively valid query models correlate well with system rankings resulting from manual topics? 9. ACKNOWLEDGMENTS Krisztian Balog was supp orted by the Netherlands Organisation for Scientific Research (NWO) by a grant under pro ject numb er 220-80-001. Maarten de Rijke was also supp orted by NWO under pro ject numb ers 017.001.190, 220-80001, 264-70-050, 354-20-005, 600.065.120, 612-13-001, 612.000.106, 612.066.302, 612.069.006, 640.001.501, 640.002.501, and by the E.U. IST programme of the 6th FP for RTD under pro ject MultiMATCH contract IST-033104. 462