SIGIR 2007 Proceedings

Session 24: Evaluation III

How Well does Result Relevance Predict Session Satisfaction?
Scott B. Huffman, Michael Hochster
Google, Inc.
{huffman,hochster}@google.com

ABSTRACT
Per-query relevance measures provide standardized, rep eatable measurements of search result quality, but they ignore much of what users actually exp erience in a full search session. This pap er examines how well we can approximate a user's ultimate session-level satisfaction using a simple relevance metric. We find that this relationship is surprisingly strong. By incorp orating additional prop erties of the query itself, we construct a model which predicts user satisfaction even more accurately than relevance alone.

Usability and user satisfaction studies [16, 2, 15, 11, 8] measure the user exp erience more directly, providing valuable insights, but are much less scalable and rep eatable. This raises the question: what is lost by using p er-query relevance metrics instead of a more comprehensive view of the session? Or put another way, how well do relevance metrics predict user satisfaction? That is the question we address in this pap er. More sp ecifically, assuming a user with some information need starts a search session by typing query Q: 1. How well can we predict the user's satisfaction with their search session, based on the relevance of results returned for Q? 2. Does the relationship b etween first-query result relevance and session-level satisfaction vary dep ending on the typ e of query or information need? For question 1, we find a surprisingly strong relationship b etween the relevance of the first query of the session and the user's ultimate satisfaction with the session. For question 2, we find that by incorp orating additional prop erties of the first query, we are able to construct a model which predicts user satisfaction more accurately than relevance alone. In addition, we find that the remaining prediction error can b e further reduced by incorp orating the average numb er of events in the rated sessions.

Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval

General Terms
Measurement, Exp erimentation

Keywords
search evaluation, user satisfaction, relevance metrics, precision

1.

INTRODUCTION

Traditional work in search engine evaluation can b e group ed into two broad categories: relevance measurements and user satisfaction studies. Relevance measurements [7, 19, 14] (such as nDCG [9], MAP [5], bpref [6], Precision at K, etc.) measure a search engine's p erformance on a set of queries by grading the relevance of each result returned. This is convenient for comparing engines over time, b ecause once results are graded, aggregate relevance can b e re-calculated for different result lists or orderings. However, relevance measurements ignore much of what users do with search engines: long sessions with multiple searches, scanning of result titles and snipp ets, op ening results, and using other search engine features.

2. METHODOLOGY
In order to collect data representative of real searches, we drew a random sample of 200 US English-language queries submitted to the Google search engine in mid-2006. Explicitly p ornographic queries were excluded from the sample. To measure result relevance, we re-submitted these 200 queries to Google, and asked raters to assess the relevance of each of the first three search results returned. Seven raters rated each query-result pair on a graduated scale; their ratings were normalized and averaged to a score b etween 0 and 1. We also calculated a summary relevance score for each query, by taking a p osition-weighted average of the result relevance scores for the query. Separately, we asked raters to categorize each query along several dimensions: whether it contained a missp elling, whether they felt it was navigational, transactional, or informational [4, 12], whether it was topically sp ecific or broad, whether it referred to a sp ecific entity, and so on. Nine raters categorized each query, and category lab els were assigned by ma jority vote. For example, a query would b e lab eled "navigational" if more than half the raters marked it as such.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. Copyright 2007 ACM 978-1-59593-597-7/07/0007 ...$5.00.

567


SIGIR 2007 Proceedings

Session 24: Evaluation III

To measure user satisfaction with search sessions starting from each query, we first asked five raters to give their interpretation of a typical user's likely "information need" corresp onding to each query. To address the p ossible ambiguity of some queries, we allowed raters to provide primary and secondary statements; secondary statements were supplied for 87 of the original 200 queries. These statements were then centrally reviewed, the most frequent primary and secondary statements were chosen for each query, and the form of the language was normalized. A few examples of queries and their corresp onding information need statements and query categories are shown in Table 1. Finally, each query and corresp onding information need were given to a separate group of human raters. Similar to "sp ecified task" studies, these raters were asked to pretend to b e a p erson with the information need [13]. They started a session by issuing the given query to Google, and then continued as they felt a p erson with this information need might, stopping when their need was met or when they wanted to give up. After the session was complete, they were asked to rate their satisfaction with their exp erience and provide comments. Their web browser actions during the session were recorded. Seven raters were given each query+(primary or secondary) information need. For queries with only a primary information need, the seven satisfaction ratings were averaged and normalized to a score b etween 0 and 1. For queries with b oth primary and secondary information needs, we calculated a blended satisfaction score by taking 0.7 times the average of primary satisfaction ratings plus 0.3 times the average of secondary satisfaction ratings. Similarly, we computed the average numb er of pages sequentially visited in the rater sessions; for queries with b oth primary and secondary information needs, we blended using a weighted average. To summarize the setup: we start with random queries that were issued to Google. This has the great advantage of making our results generalizable to Google's querystream. The disadvantage is that we don't know the original intent b ehind the queries. To infer these intents, we call on a group of raters. Separately, we ask additional raters to provide query categorizations, relevance and satisfaction measures. We separated the task of inferring intent from the session satisfaction task in order to keep each task tractable for the raters, and thus produce cleaner data.

3.1 Satisfaction versus relevance
How strong is the correlation of first query relevance to session satisfaction? Figure 1 shows a scatter-plot of session satisfaction versus first-query relevance. The plot is scaled so that 0 and 1 corresp ond to the maximum and minimum observed value for each dimension.

Satisfaction

0.2

0.4

0.6

0.8

1.0

Correct spelling Misspelling

0.0 0.0

0.2

0.4

0.6

0.8

1.0

Relevance

Figure 1: Session satisfaction vs. first-query relevance As the plot shows, there is a reasonably strong but somewhat noisy linear relationship b etween satisfaction and relevance; the Pearson correlation is 0.727. The scatter-plot shows a stronger correlation at the upp er end of the relevance scale. In fact, there are no p oints in the lower right quadrant, indicating that, at the high end of the scale, firstquery relevance is a very strong predictor of session satisfaction. At lower relevance levels, the correlation is weaker. Examining the largest outliers, we observed that several are queries for which in addition to search results, Google returns a sp elling suggestion ("Did you mean..."). We've marked these queries (26 of the original 200) with an X in the plot. For these queries, session satisfaction is much higher than the corresp onding first-query relevance would predict. This makes sense: users saw the sp elling suggestion, clicked on it, and continued their session with the (presumably much more relevant) results of the sp ell-corrected query; thus the relevance of the initial search results is uncorrelated with the user's satisfaction. Sp elling suggestions are only one example of "extra" elements that modern web search engines return to users; other examples are current news, images, "local" results, stock charts, etc. A user looking for the current price of a stock who typ es the ticker symb ol into Google or Yahoo will often b e satisfied by a chart at the top of the page, without looking at any web search results. As researchers strive to evaluate how effectively search engines satisfy users' needs, these "extra" elements will need to b e accounted for. For missp ellings, the relevance of the sp ell-corrected query's results to the user's information need could b e used to help predict satisfaction. We do not have that data in the present study. Removing these 26 queries from our dataset, the

3.

ANALYSIS

In the following sections, we compare the relevance of the first query of the session (which we will refer to simply as "relevance") to the user's final satisfaction with the search session. To help make this comparison, it is convenient to compute an aggregate measure of relevance. For each query, we have relevance judgments b etween 0 and 1 for the top three search results. We define relevance of a query, Relevance[q], as a simple p osition-weighted mean (p osition meaning whether the result was 1st, 2nd, or 3rd for the query), weighted by 1/p osition. Calling the relevance judgments for each p osition pos1, pos2, and pos3: Relevance[q] = (pos1 + pos2/2 + pos3/3) 1+ 1 + 1 2 3

This is a simple cumulative discounted gain measure [9].

568


SIGIR 2007 Proceedings

Session 24: Evaluation III

Query "call for help" Leo

leonard cohen carol fisher price

red envelop e

session saver firefox 1.5

sleeppy hollow

Primary info. need User exp ects to b e taken to the website for the TV show "Call For Help". User is looking for Christmas carols sung by Leonard Cohen. User exp ects to b e taken to the website for the toy company 'Fisher Price'. User exp ects to b e taken to the homepage of Red Envelop e, the well-known gift site. User is looking to download the Session Saver add-on for Firefox 1.5 User is looking for information related to the movie Sleepy Hollow.

Secondary info. need User is looking for information ab out Leo LaPorte, the host of the TV show "Call For Help".

Query attributes informational, sp ecific

informational navigational

User is looking for the meaning of the red envelop e or red packet given in Chinese society. User is looking for information ab out the Session Saver add-on for Firefox 1.5. User is looking for the town of Sleepy Hollow, NY.

navigational, sp ecific

transactional, sp ecific

informational, missp elled

Table 1: Examples of query information need statements correlation of session satisfaction and first-query relevance increases to 0.784 for the remaining 169 queries.
1.0 Satisfaction 0.0 0.0 0.2 0.4 0.6 0.8

3.2 Modeling satisfaction
Having removed missp ellings, we now attempt to construct a more accurate model of satisfaction, based on firstquery p ositional result relevance and other prop erties. We build an increasingly complex sequence of linear models and measure their p erformance in two different ways. The first, "overall correlation" is the correlation of the observed user satisfaction scores with the model predictions. Correlation is the square root of the standard R-squared statistic, which may also b e calculated as the correlation b etween the model predictions and the true values. One p otential drawback of this goodness-of-fit measure is that it must increase as the numb er of variables in the model increases. To address this, we also present a cross validated correlation, calculated by leaving out one data p oint at a time and predicting it from the others. The cross-validated correlation is the correlation b etween the model prediction and the leave-one-out predictions. For our sequence of models, the cross-validated correlation and overall correlation are ab out the same, giving us some assurance that the models are not over-fitting. Figure 2 shows a scatter-plot of satisfaction versus the relevance score for the first p osition result of the first query in the session (missp elled queries excluded). Our simplest model is based only this first-p osition relevance. Interestingly, the resulting model (Pos1) still produces a reasonable correlation to session satisfaction. The correlation is 0.722, lower than the correlation with the 1/p osition-weighted relevance mean of the top three p ositions (0.784), but strong enough to underscore the imp ortance of the first p osition to users. The cross-validated correlation is slightly lower, at 0.714. Of course, we can do b etter by using the relevance scores of all three top p ositions for the first query. This model (call it Pos1+Pos2+Pos3) produces a correlation of 0.786. This is almost identical to the correlation produced by our 1/p osition-weighted mean. At least for this dataset, weighting by 1/p osition is ab out as accurate as deriving weights

0.2

0.4

0.6

0.8

1.0

Relevance

Figure 2: Session satisfaction vs. relevance of firstposition result

from the satisfaction data itself. The cross-validation correlation is again slightly lower at 0.774.

3.3 Effect of query properties
In this section we investigate whether other query properties help explain satisfaction ab ove and b eyond the relevance of the results. We look first at whether the query is identified as primarily navigational, informational, or transactional [4, 12], based on the dominant opinion among nine human judges p er query. Figure 3 shows scatter-plots of satisfaction versus relevance at each of the top three p ositions for b oth navigational and non-navigational queries. Least-squares lines are sup erimp osed. The plots show that for navigational queries, the relationship b etween relevance and satisfaction weakens rapidly after the first p osition. For non-navigational queries, on the other hand, relevance and satisfaction are

569


SIGIR 2007 Proceedings

Session 24: Evaluation III

Non-navigational Position = 1
1.0 0.8 0.6 0.4 0.2

Non-navigational Position = 2

Non-navigational Position = 3

Satisfaction

Navigational Position = 1
1.0 0.8 0.6 0.4 0.2

Navigational Position = 2

Navigational Position = 3

Satisfaction

0.4

0.6

0.8

1.0

Query type Informational Non-informational

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

0.2

Relevance

0.2

0.4

0.6

0.8

1.0

Figure 3: Session satisfaction vs. relevance, for navigational and non-navigational queries Coefficient (Intercept) p os1 p os2 p os3 p os1nav p os2nav p os3nav Estimate 0.3983 0.1012 0.0883 0.1103 0.0938 -0.0653 -0.135 p-value 0.0000 0.0000 0.0003 0.0000 0.0015 0.1636 0.0007

Relevance

Figure 4: Session satisfaction vs. relevance, with informational queries indicated Coefficient (Intercept) p os1 p os2 p os3 p os1info p os2info p os3info Estimate 0.4076 0.1468 0.0872 0.0353 -0.0391 -0.0097 0.0655 p-value 0.0000 0.0000 0.0035 0.1675 0.1715 0.8104 0.0776

Table 2: Summary of (Pos1 + Pos2+ Pos3)*Nav model

strongly related at all three p ositions. The two "outliers" on the navigational side are rare queries which are navigational in nature but for which few relevant results are returned. Therefore, the queries score low b oth on relevance and on session satisfaction. To model how the relationship b etween relevance and satisfaction may vary by typ e of query, we built a linear model (Pos1 + Pos2+ Pos3)*Nav, which estimates two coefficients for the relevance at each p osition: one for navigational queries, and one for non-navigational queries. This model produces an improved correlation of 0.807, and cross-validated correlation of 0.791. The model statistics are summarized in Table 2. The coefficients with no suffix corresp ond to non-navigational queries, while the coefficients with a "nav" suffix corresp ond to navigational queries. Notice that for navigational queries, the implicit weighting on the first result relative the the other two is much higher than for non-navigational queries. For non-navigational queries, the relative weighting is much flatter across the three results. This comp orts with our intuition that for navigational queries, what matters most is that the target result app ears in the first p osition, while for other kinds of queries the information accumulated in later results is more imp ortant. We investigated several other query categories (informational, transactional, sp ecific, broad, refers-to-entity) and found that they add little information b eyond relevance in explaining user satisfaction. Figure 4 shows the scatter plot of satisfaction versus relevance, lab eled with 'X's for infor-

Table 3: Summary of (Pos1 + Pos2+ Pos3)*Info model mational queries; figure 5 shows the same for transactional queries. The X's are more or less randomly mixed in with the dots in b oth cases, demonstrating that informational/noninformational and transactional/non-transactional status of a query is not helpful b eyond relevance in explaining satisfaction. The regression output for model (Pos1 + Pos2 + Pos3)*Info corrob orates this: the coefficients show only a slight (and not statistically significant) flattening of the implicit p osition weights for informational queries. To the extent there is any effect, it is b ecause informational queries are nonnavigational. When navigational queries are removed from the data, even this small effect disapp ears, as summarized in table 4. Similar analysis shows that the other query categories (transactional, sp ecific, broad, refers-to-entity) mentioned ab ove are also uninformative for this dataset. We also tried adding simple query length (numb er of words) to the model, on the theory that p erhaps the satisfaction profile is different for long versus short queries ­ a mechanical form of the sp ecific versus broad distinction. This was also uninformative.

3.4 Explaining the remaining variance
Our b est model so far, (Pos1 + Pos2+ Pos3)*Nav has a correlation of 0.8 with observed satisfaction. What explains the remaining variance?

570


SIGIR 2007 Proceedings

Session 24: Evaluation III

CV correlation
Pos1 0.71

Overall correlation
0.72

Pos1 + Pos2 + Pos3

0.77

0.79

(Pos1 + Pos2 + Pos3)*Nav

0.79

0.81

1.0

Satisfaction

0.6

0.8

(Pos1 + Pos2 + Pos3)*Nav + NumEvents

0.82

0.84

0.2 0.4 0.6 0.8 1.0

0.2 0.4 0.6 0.8 1.0

0.4

Query type Transactional Non-transactional

0.2

Figure 6: Summary of model performance

0.2

0.4

0.6

0.8

1.0

Relevance

Figure 5: Session satisfaction vs. relevance, with transactional queries indicated

Obviously, more happ ened in the user's sessions than just the first query. The user sometimes did multiple searches, and often looked at multiple results b efore declaring herself "done" with the task and giving a satisfaction rating. A full model of satisfaction based on individual events could try to combine measurements of each event or page in a use session in some form of sequence model. We do not present such a model here, but we do have one rough proxy for what happ ened in these sessions after the initial search: we recorded the numb er of "events" ­ different pages viewed (whether search result pages or other URL's) ­ during each session. Adding the numb er of events as a main effect yields the model (Pos1+Pos2+Pos3)*Nav+NumEvents, which increases correlation to session satisfaction of 0.841, and cross-validated correlation to 0.825. Figure 6 illustrates how the models provide increasingly accurate predictions of satisfaction as more variables are added.

4. DISCUSSION
Coefficient (Intercept) p os1 p os2 p os3 p os1info p os2info p os3info Estimate 0.3961 0.1165 0.1077 0.0885 -0.0217 -0.0254 0.0308 p-value 0.0000 0.0022 0.0276 0.0820 0.6238 0.6530 0.5960 Relevance metrics are useful in part b ecause they enable rep eated measurement. Once results for an appropriate sample of queries are graded, relevance metrics can b e easily computed for multiple ranking algorithms. In a changing document collection (like the web), new results that app ear over time for queries in the sample need only b e graded incrementally in order to maintain the relevance measurement. In contrast, approaches to measuring the broader search exp erience from the user's p ersp ective, such as usability studies, diary studies, and assigned-task studies, are inherently not rep eatable. Once a particular sub ject does a particular task, she generally cannot comparably rep eat it, b ecause she has learned from doing it the first time. In addition, the data derived from these studies is not incremental in the way that result relevance judgments are; rep eating a measurement when the underlying document collection or search algorithms have incrementally changed requires rerunning the full set of tasks with new sub jects. This work may b e viewed step towards a rep eatable p erquery metric that is more strongly correlated to user satis-

Table 4: Summary of (Pos1 + Pos2+ Pos3)*Info model on non-navigational queries

571


SIGIR 2007 Proceedings

Session 24: Evaluation III

faction than pure relevance metrics. By adding query-level prop erties such as whether the query is navigational or missp elled, we were able to predict eventual user satisfaction more accurately than with result relevance judgments alone. We improved the model further by adding the average numb er of events in the rated sessions starting from the query. While in a rep eatable metric context, rated sessions presumably would not b e available, search engine logs might b e able to provide a proxy such as the average numb er of query refinements and result clicks associated with a given query over time. (Interestingly, in our models the numb er of events was found to b e a b etter predictive variable than the raw session time, which is sometimes thought to b e a strong predictor of user satisfaction.) The other query prop erties we considered (transactional vs. informational, sp ecific vs. broad, and query length) were not significant predictors of satisfaction. However, our dataset was small. While it is safe to conclude these properties and others have a small role in predicting satisfaction compared to a query's status as missp elled or navigational, analysis of a larger data set might uncover significant effects for these or other query characteristics. Predicting satisfaction might also b e aided by taking more of the elements of the session into account: Downstream searches and the relevance of their results, the utility of extra information/elements returned (stock quotes, weather, images, and so on), the numb er of result pages viewed after each search, and so on. That said, we were surprised by how well user satisfaction can b e predicted just from firstquery relevance and some query prop erties. Apparently, in web search first impressions matter a lot ­ even modeling on only the first result of the first query gives a reasonable correlation. Adding in the second and third results, the models suggest a plausible weighting among the results ­ steeply tilted towards the first result for navigational queries, and more flat for non-navigational. We now offer a few caveats ab out the methodology employed here. To b egin with, we asked p eople who were not the original issuers of each query to produce likely primary and secondary statements of user intent/information need for the query. There is some evidence that p eople can infer the goals of queries without additional information. Rose and Levinson citerose had p eople manually classify queries by typ e of user goal, with and without additional information ab out the click b ehavior of the session the query was drawn from, and found no substantial differences. They concluded that "Although this requires further study, it suggests the surprising result that goals can b e inferred with almost no information ab out the user's b ehavior." ([12], page 17). In our case, we reviewed the user intent statements and felt they were reasonable, but we cannot b e sure of the true range of user intents. We also arbitrarily used only up to two intent statements p er query, and for our analysis, we combined the mean satisfaction scores for primary and secondary intents in an ad-hoc way (70%/30%). Reliable attribution of user intents to randomly selected queries is a difficult problem which merits further study. We chose to base our analysis on relevance judgments for the first query of the users' sessions. The reason for this is that our interest lies in measuring search engine p erformance­ the first query of a session is very imp ortant from this p ersp ective. However, for sessions containing multiple searches, it may b e more predictive of user satisfaction to consider an

intermediate search or the final search p erformed. Kahneman's p eak-end rule [10] suggests that the strongest predictors of satisfaction are the p eak (p ositive or negative) and final elements of an exp erience, regardless of duration. The strong correlation b etween relevance and satisfaction observed in this study might have b een even stronger had we used the final query instead of the first. Some previous studies have attempted to measure the impact of p er-query relevance on user p erformance (as opp osed to user satisfaction) for various kinds of constructed tasks. For example, Turpin and Scholer [17] artificially varied mean average precision of search results returned to a group of users, and asked the users to p erform a precision-based task (find a single document) and a recall-based task (find as many relevant documents as p ossible within five minutes). They found only a weak relationship b etween relevance and p erformance on these typ es of tasks. Similarly, on a set of sp ecific question-answering tasks, Turpin and Hersh [18] found a lack of correlation b etween mean average precision and p erformance. Allan et al. [1], in contrast, found that large improvements in retrieval accuracy (bpref ) did improve sub jects' sp eed and effectiveness in constructing answers to multi-faceted questions. Other researchers have suggested that metrics b etter reflecting actual user b ehavior can b e produced by measuring IR systems in the context of users p erforming tasks. For example, Reid [11] prop osed evaluating IR systems based on retrieved documents' utility to a task rather than relevance to a query, and Borlund [3] prop osed a framework for evaluating interactive IR systems by placing evaluators into task scenarios. Our focus in this work is on evaluating search engines as they are used in practice. This was our motivation for starting with a random sample of actual user queries. These queries are widely varied, and many of them are not adequately describ ed as simple precision or recall tasks. In our view, the user is the ultimate arbiter of a search engine's quality. If a relevance metric correlates well with user satisfaction, that is a strong vote in its favor, even if it correlates less well with narrower measures such as user p erformance on constructed tasks.

5. CONCLUSIONS
Our analysis is a step towards bridging the gap b etween relevance metrics and user satisfaction. We have demonstrated that this gap is not nearly as large as one might think, esp ecially given that relevance metrics ignore all asp ects of UI. Moreover, we found at least one imp ortant dimension (navigational/non-navigational) which modulates the relationship b etween satisfaction and relevance. This suggests that it might b e reasonable to use different relevance metrics or discounting functions for different typ es of queries. In our view, an imp ortant attribute of any relevance metric is the degree to which it represents user satisfaction. As we improve our understanding of the relationship b etween relevance and satisfaction, we will b e b e b etter able to discover how well search engines meet the needs of their users.

6. ACKNOWLEDGMENTS
Thanks to Dan Russell for helpful comments on earlier drafts of this pap er.

572


SIGIR 2007 Proceedings

Session 24: Evaluation III

7.

REFERENCES

[1] J. Allan, B. Carterette, and J. Lewis. When will information retrieval b e "good enough"? In SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 433­440, New York, NY, USA, 2005. ACM Press. [2] D. Bilal. Children's use of the Yahooligans! Web search engine: Cognitive, physical, and affective b ehaviors on fact-based search tasks. J. Am. Soc. Inf. Sci., 51(7):646­665, 2000. [3] P. Borlund. The I IR evaluation model: a framework for evaluation of interactive information retrieval systems. Information Research, 8(3), April 2003. [4] A. Broder. A taxonomy of web search. SIGIR Forum, 36(2):3­10, 2002. [5] C. Buckley and E. M. Voorhees. Evaluating evaluation measure stability. In SIGIR '00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 33­40, New York, NY, USA, 2000. ACM Press. [6] C. Buckley and E. M. Voorhees. Retrieval evaluation with incomplete information. In SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 25­32, New York, NY, USA, 2004. ACM Press. [7] C. W. Cleverdon. The significance of the Cranfield tests on index languages. In SIGIR '91: Proceedings of the 14th annual international ACM SIGIR conference on Research and development in information retrieval, pages 3­12, New York, NY, USA, 1991. ACM Press. [8] B. J. Jansen and U. Pooch. A review of web searching studies and a framework for future research. J. Am. Soc. Inf. Sci. Technol., 52(3):235­246, 2001. [9] K. J¨rvelin and J. Kek¨l¨inen. Cumulated gain-based a aa evaluation of IR techniques. ACM Trans. Inf. Syst., 20(4):422­446, 2002. [10] D. Kahneman, P. P. Wakker, and R. Sarin. Back to Bentham? Explorations of exp erienced utility. The Quarterly Journal of Economics, 112(2):375­405, May 1997. [11] J. Reid. A task-oriented non-interactive evaluation methodology for information retrieval systems. Information Retrieval, 2(1):115­129, 2000.

[12] D. E. Rose and D. Levinson. Understanding user goals in web search. In WWW '04: Proceedings of the 13th international conference on World Wide Web, pages 13­19, New York, NY, USA, 2004. ACM Press. [13] D. M. Russell and C. Grimes. Assigned and self-chosen tasks are not the same in web search. In HICSS '07: Proceedings of the 40th Annual International Conference on Systems and Software, 2007. [14] M. Sanderson and J. Zob el. Information retrieval system evaluation: effort, sensitivity, and reliability. In SIGIR '05: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pages 162­169, New York, NY, USA, 2005. ACM Press. [15] A. Spink. A user-centered approach to evaluating human interaction with web search engines: an exploratory study. Inf. Process. Manage., 38(3):401­426, 2002. [16] J. Teevan, C. Alvarado, M. S. Ackerman, and D. R. Karger. The p erfect search engine is not enough: A study of orienteering b ehavior in directed search. In CHI '04: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 415­422, New York, NY, USA, 2004. ACM Press. [17] A. Turpin and F. Scholer. User p erformance versus precision measures for simple search tasks. In SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 11­18, New York, NY, USA, 2006. ACM Press. [18] A. H. Turpin and W. Hersh. Why batch and user evaluations do not give the same results. In SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 225­231, New York, NY, USA, 2001. ACM Press. [19] E. M. Voorhees. Evaluation by highly relevant documents. In SIGIR '01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pages 74­82, New York, NY, USA, 2001. ACM Press.

573