WWW 2007 / Track: Search Session: Search Potpourri Navigation-Aided Retrieval Shashank Pandit Carnegie Mellon University shashank@cs.cmu.edu Christopher Olston Yahoo! Research olston@yahoo-inc.com ABSTRACT Users searching for information in hyp ermedia environments often p erform querying followed by manual navigation. Yet, the conventional text/hyp ertext retrieval paradigm does not explicity take p ost-query navigation into account. This pap er prop oses a new retrieval paradigm, called navigationaided retrieval (NAR), which treats b oth querying and navigation as first-class activities. In the NAR paradigm, querying is seen as a means to identify starting p oints for navigation, and navigation is guided based on information supplied in the query. NAR is a generalization of the conventional probabilistic information retrieval paradigm, which implicitly assumes no navigation takes place. This pap er presents a formal model for navigation-aided retrieval, and rep orts empirical results that p oint to the real-world applicability of the model. The exp eriments were p erformed over a large Web corpus provided by TREC, using human judgments on a new rating scale develop ed for navigation-aided retrieval. In the case of ambiguous queries, the new retrieval model identifies good starting p oints for p ost-query navigation. For less ambiguous queries that need not b e paired with navigation, the output closely matches that of a conventional retrieval system. hyp erlinks) [30]. There are at least three reasons for using navigation as a search tactic: 1. Difficulty in formulating appropriate queries: Users often do not convey their search task to a retrieval system accurately enough to allow them to forgo navigation. A variety of factors may b e to blame, including limited query language expressiveness, user inexp erience, lack of familiarity with terminology, and cognitive ease of entering short queries. 2. Open-ended search tasks: In many cases the scop e of the task is broad, and the user has not (yet) formed a concrete notion of what information would lead to its successful completion. (Indeed there may not even b e a meaningful notion of completion.) For example, consider a guitar enthusiast who recently moved to a new city and wishes to find out ab out the city's guitar culture and local guitar-related resources. This task's scop e may include p erformances, lessons, shops, social networking, gigs, as well as other asp ects the user discovers in the process of searching (e.g., p erhaps he discovers that a local museum features an exhibit of musical instruments). Op en-ended search tasks often entail a significant amount of manual navigation, as part of an extended process of exploration, discovery, and task/query refinement known as b errypicking [2] or information foraging [29]. 3. Preference for orienteering: At times, users prefer to navigate rather than "telep ort" to a target document, b ecause doing so enables them to understand the surrounding context. This b ehavior is called orienteering [36]. Despite the prevalence of navigation as a search tactic, the conventional (hyp er)text retrieval paradigm focuses uniquely on querying. In this pap er we approach the combination of querying and navigation as the unit of interest, in which querying merely identifies starting p oints for navigation, and navigation is guided based on the user's query. Categories and Subject Descriptors H.3 [Information Systems]: Information Storage and Retrieval General Terms Algorithms, Design, Exp erimentation Keywords Web search, Navigation, Browsing, Link analysis, Undersp ecified search tasks 1. INTRODUCTION While p erforming search tasks1 in the hyp ertext environments such as the World Wide Web, users often supplement querying with extensive manual navigation (i.e., traversal of 1 A search task is the quest for information by a p erson with insufficient knowledge to solve a certain problem at hand [3]. 1.1 Navigation-Aided Retrieval We prop ose a new hyp ertext retrieval paradigm that incorp orates p ost-query user navigation as an explicit comp onent, called navigation-aided retrieval (NAR). With NAR, as with the conventional paradigm, users submit freely-chosen keyword queries and are presented with a ranked list of documents. Unlike with the conventional paradigm, the documents in the list represent starting points from which the user can commence exploration. Loosely sp eaking, good Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2007, May 8­12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005. 391 WWW 2007 / Track: Search Session: Search Potpourri Figure 1: Screenshots of our prototype navigation-aided retrieval system, for query "guitar" over Craigslist Pittsburgh. starting p oints are hyp ertext documents that, while they may not match the user's query directly, p ermit easy navigation to many documents that do match the query, via one or more outgoing hyp erlink paths (our starting p oints differ from Kleinb erg's "hubs"; see Section 2.1). Figure 1 shows the output of our prototyp e NAR system called Volant for the query "guitar" over a community bulletin-b oard Web site called Craigslist Pittsburgh2 . The upp er screenshot shows the initial resp onse page (list of starting p oints); the other three show sample content from each of the top three starting p oints. Notice how the list of starting p oints neatly categorizes the relevant information available on the site. This categorization is a consequence of the fact that the Web site is appropriately organized, combined with the navigation-aware design of the scoring function used by Volant to identify and rank starting p oints. The model underlying the scoring function assumes the user has a certain prop ensity to navigate outward from the initial query results, and that navigation is directed based on the user's search task. Imp ortantly, our navigation-aided retrieval model strictly generalizes the conventional probabilistic information retrieval model, which implicitly assumes no prop ensity to navigate (formal details are provided in Section 3). Coming back to Figure 1, notice that certain hyp erlinks are highlighted (i.e., they have a shaded background). Volant assists users as they navigate outward from the starting p oints by highlighting hyp erlinks that lead to one or more 2 documents matching the query ("guitar," in this case), following [20, 28]. In general, we refer to p ost-query hyp erlink annotation as navigation guidance, which is the second key asp ect of the NAR paradigm. In NAR systems, the user may revise his query while navigating. Doing so causes the navigation guidance annotations (e.g., link highlighting) to up date in place, without changing the content on display.3 The user can therefore refine/reformulate his query incrementally, without losing his b earings. To summarize, in light of the fact that users tend to supplement querying with navigation, navigation-aided retrieval ensures that querying and navigation work together, rather than interfering with each other: The initial query takes the user to document neighb orhoods that are b oth relevant and navigable. Once there, the user can navigate while retaining relevance indicators created by querying. Conversely, he can re-query while retaining the context arrived at by navigation. 1.2 Organic versus Synthetic Structure The imp ortance of supplying context with query results is well recognized [15]. Most prior work along these lines aims to synthesize some sort of structure automatically when a query arrives, to serve as a contextual backdrop for query results and provide semantically meaningful avenues for exploration. Examples include query result clustering [40, 41] and 3 The user can of course request a fresh list of starting p oints instead, if desired. http://pittsburgh.craigslist.org 392 WWW 2007 / Track: Search "faceted" query/navigation interfaces [16, 33]. These approaches can b e characterized as navigation-aided retrieval with synthetic structure (Synthetic NAR). In this pap er we are primarily interested in navigationaided retrieval using organic structure, i.e., structure that is naturally present in pre-existing hyp ermedia documents (Organic NAR). Synthetic and Organic NAR each offer certain advantages. Clearly, Synthetic NAR is more flexible, and does not rely on the a priori existence of suitable starting p oint documents. The principal advantages we p erceive for Organic NAR are: · Human oversight. Working with pre-existing structure ensures that a human oversees the way information is organized. Therefore it is more likely that categories make sense, have prop er lab els, and that each category has information organized in a useful way (e.g., Craigslist p ostings are sorted by date). · Familiar user interface. In our paradigm, the user interface departs only slightly from the one users are already familiar with (i.e., queries return a simple, flat list of links to documents). Interface complexity can b e dangerous: a user study evaluating a kind of faceted search interface [33] found that the interface, which displays several typ es of information views and navigation paths simultaneously, required significant learning time, which hamp ered effectiveness. · A single view of the document collection. Unlike with faceted search interfaces, with Organic NAR users only have to contend with one "view" of the document collection, and content is presented in the same way each time they visit (modulo navigation guidance). Hence users are more easily able to retrace their steps, a prop erty that is crucial for orienteering. · Robust implementation by a third party. Our paradigm based on exploiting pre-existing structure requires no semantic knowledge. Consequently, it is relatively easy to achieve a robust implementation. Moreover, the retrieval service can b e provided by a third party that does not own or understand the corpus. (A preliminary empirical comparison of Organic and Synthetic NAR is presented in Section 5.5.) Session: Search Potpourri 2.1 Selecting Starting Points Best Trails [39] is a retrieval system which selects starting p oints in resp onse to queries, however, it restricts them to b e documents matching the query. The scoring function prop osed is ad-hoc in nature and does not take into account navigability factors, which are central to our work. Further, its user interface departs substantially from the traditional query/browse interface and is difficult to use (as rep orted by the user study in [39].) In contrast, our prototyp e Organic NAR system closely adheres to familiar interfaces offered by p opular query and browse tools. The hyp ertext retrieval paradigm known as topic distil lation [5, 7, 8] b ears some similarity to NAR. Topic distillation aims to identify a small numb er of high-quality documents that are representative of a broad topic area (e.g., bicycling). Much of the work to date on topic distillation uses as a primitive the HITS algorithm [22], which identifies bipartite subgraphs consisting of hubs (related to our notion of starting p oints) and authorities. Under the HITS model, good hubs are those that have many links to good authorities. Symmetrically, good authorities are those that are linked to by many good hubs. HITS-based approaches are inherently effective only for broad topic areas for which there are many hubs and authorities. NAR does not share this limitation, and in principle can accommodate arbitrarily narrow search tasks (see Section 5.3). Furthermore, a NAR starting p oint is more general than a HITS hub in the following ways: · A starting p oint may b e multiple links away from documents matching the query. · It is easy to navigate outward from a starting p oint (i.e., the link anchor texts are informative and indicative of the documents reachable via the link). There exist sp ecific approaches [25, 6] to address each of these concerns, but none of them is as general as NAR. The work of [21] aims to identify nodes of maximum influence in a social network, where the definition of influential nodes is related to our notion of good starting p oints. As with all prior work we are aware of, a key difference is that our work takes navigability factors into account when evaluating p otential starting p oints. 1.3 Contributions The contributions of this pap er are: · Formal model of navigation-aided retrieval (Section 3). · Implementation techniques for a NAR-based retrieval system (Section 4). · Empirical evaluation via a user study (Section 5). Related work is discussed next. 2.2 Guiding Navigation WebWatcher [20] highlights hyp erlinks along paths taken by previous users who had p osed similar queries. This approach may not b e suitable for op en-ended search tasks where the query is not a good representation of the underlying search task. Letizia [24] and Personal WebWatcher [26] do not rely on queries but instead passively observe the user's browsing b ehavior in order to learn a model of his search task, and highlight links that match the inferred task. This approach can in principle b e incorp orated into a NAR system, to augment guidance based on explicit queries. Highlighting of hyp erlinks based on an explicit query, as a method of navigation guidance, was evaluated via a user study in our previous work [28]. Guided navigation was found to result in significantly faster completion of certain search tasks compared to traditional query and browsing interfaces, assuming the user already knows of a suitable starting p oint. Automatic identification of good starting p oints is the focus of the present pap er. 2. RELATED WORK We have already discussed work on presenting query results in the context of dynamically-synthesized structure, in Section 1.2. There is a relatively small b ody of work that shares our approach of leveraging pre-existing structure, which we discuss here. We consider two distinct subb odies in turn: (1) finding starting p oints for exploration, and (2) providing guidance during navigation. 393 WWW 2007 / Track: Search Symbol D T ST Meaning set of documents in the corpus user's search task answer set for search task T , i.e., the set of documents any of which is sufficient to complete the search task T set of valid queries for search task T , i.e., the (p ossibly infinite) set of queries that might b e p osed by a user p erforming T Session: Search Potpourri Our goal is to formulate a scoring function, such that document d is given a score equal to the probability that the user completes his search task if he starts navigation at document d. We denote the scoring function as (d, q ). Thus, (d , q ) = = P r { user completes T | user starts navigating at d} P P r {d ; d , d ST | q QT } » P r {d ST | q QT } × P r {d ; d | d ST , q QT } ­ QT d D Table 1: Table of symbols used in the NAR model = d D 3. NAVIGATION-AIDED RETRIEVAL MODEL P A navigation-aided retrieval system takes as input a user query, and returns one of: (1) a synthetic starting p oint document (in the case of a Synthetic NAR system), or (2) an ordered list of links to organic starting p oints (in the case of an Organic NAR system). Subsequent navigation on the part of the user is p erformed under guidance from the retrieval system based on the user's query. In this pap er we focus on Organic NAR. Our aim is to develop a numeric scoring function for ranking of organic starting p oint documents. We first supply an abstract formulation (Section 3.1), and then explore concrete instantiations (Section 3.2). 3.2 Instantiations of Generic Model We supply two instantiations of the generic NAR model given in Section 3.1: the conventional probabilistic retrieval model, which does not anticipate navigation outward from query results (Section 3.2.1), and a new approach that incorp orates a model of p ossible p ost-query user navigation actions (Section 3.2.2). 3.2.1 Conventional Probabilistic IR Model Our generic NAR model reduces to the classical probabilistic IR model if we apply the following assumption: Conventional IR assumption. The user has no prop ensity to navigate outward from any of the retrieved documents, i.e., P r {d ; d | d ST , q QT } = 1 if d = d = 0 otherwise Under this assumption, the NAR scoring metric reduces to: c (d, q ) = P r {d ST | q QT } Recall that P r {d ST | q QT } represents the probability that document d is relevant to the search task underlying query q . Several methods have b een prop osed in the probabilistic IR literature for estimating this term, e.g., [23, 35]. 3.1 Generic Model We now describ e the basic framework and terminology underlying the NAR model. A quick reference for the symb ols used is provided in Table 3.1. Let D b e the set of documents in the searchable corpus. Let T denote the user's underlying search task, which is unknown to the retrieval system (and p erhaps not fully known to the user himself at the outset). We assume the user is able to complete a search task by viewing exactly one document4 in D. The answer set (ST ) for a search task T is defined as the set of all documents, any one of which suffices to complete the search task T . For a given search task T , a valid query is one which might b e p osed by a user p erforming T . The (p otentially infinite) set of all valid queries is denoted by QT = {q1 , q2 , . . .}. The NAR scoring function is based on two distinct "submodels" that account for query relevance and user navigation, resp ectively: Query submodel. A query submodel provides a b elief distribution for the answer set ST , given that the user reveals a query q to b e a memb er of QT . It is used to estimate the likelihood that a particular document d results in completion of the task, i.e., P r {d ST | q QT }. Navigation submodel. A navigation submodel assesses the likelihood with which a user starting from document d is able to navigate to a document d ST , thereby successfully completing the search task. To b e precise, let d ; d denote the event that a user p erforming the search task T navigates successfully from document d to document d , under guidance G(q ) provided for query q . A navigation submodel is used to estimate P r {d ; d | d ST , q QT }. In general, a search task might require visiting multiple pages for completion; we b elieve our model can b e extended to handle such situations. 4 3.2.2 Navigation-Conscious Model Now, we present an instantiation of the NAR model which takes into account the p ossibility of the user navigating further from the results retrieved for a query. We first instantiate the two submodels and then derive a formula for the scoring function, under these submodels. Query submodel. Any probabilistic IR relevance ranking function can b e used to estimate the term P r {d ST | q QT }. We refer to such an estimate as R(d , q ). Navigation submodel. As mentioned b efore, the navigation submodel estimates the probability of a user browsing from a given document to another document while p erforming a search task. To estimate this quantity, we adopt the stochastic model of user navigation b ehavior called WUFIS prop osed by Chi et al. [9, 10]. WUFIS is based on the theory of information scent [29], and has b een validated against real user traces [10]. At the heart of WUFIS lies a function W (N , d1 , d2 ), which gives the probability that a user whose search task is characterized by information need N navigates successfully along some path from document d1 to document d2 (N is encoded as a term vector). Every hyp erlink, via anchor and surrounding context, provides some information scent regard- 394 WWW 2007 / Track: Search ing the content reachable by clicking that hyp erlink. WUFIS assumes that the probability of a user following a hyp erlink dep ends on how well his information need matches the information scent provided by the hyp erlink. The information scent is approximated by a weighted vector of terms occurring in the anchor text and optionally in a small window around the hyp erlink. The cosine similarity b etween the information need vector and the information scent vector is used to estimate the probability that the user follows the given hyp erlink, along with an assumed attrition probability 0 < 1 (set to 0.85 in our exp eriements). The navigation probability b etween two documents d1 and d2 is given by the product of link navigation probabilities along the sequence of links connecting d1 to d2 . We prop ose to use WUFIS as the basis for our navigation submodel.5 Constructing an estimate of N from a document in the answer set ST is a challenge in itself, but rather orthogonal to the focus of our work. Hence, we take the naive approach of approximating N by extracting the first k terms from the document. Better ways of extracting the information need will likely lead to an improvement in p erformance. Let N (d) denote the information need extracted from some d assumed to b e a memb er of ST . We instantiate our navigation submodel by setting P r {d ; d | d ST , q QT } = W (N (d ), d , d ). Scoring function. When we combine our query submodel with our navigation submodel, we arrive at the following starting p oint scoring function: n ( d , q ) = X R (d , q ) × W (N (d ), d , d ) (1) Session: Search Potpourri the WWW in order to produce the desired resp onse -- either starting p oints or Web pages enhanced with navigation guidance. Implementation of a NAR system is not the focus of this pap er, and so we heavily relied on third party softwares to build Volant. The Lucene text indexing and retrieval systemserved as Volant's content engine. The connectivity server was built on top of a MySQL database [27], while the intermediary was coded using Java servlets running on an Apache Tomcat server [37]. Exploring more efficient implementation options (e.g., the connectivity server in [4]) would b e valuable future work. Next, in Section 4.2, we describ e the index structures Volant creates offline via preprocessing the corpus. Then, we describ e how Volant selects starting p oints in resp onse to keyword queries (Section 4.3) and adds navigation guidance to subsequent Web pages that the user visits (Section 4.4). We discuss p erformance and scalability issues briefly in Section 4.5. 4.2 Preprocessing 4.2.1 Content Engine As mentioned b efore, Lucene serves as Volant's content engine, and is used to estimate the value of R(d, q ), the probability that document d is relevant to the query q . We are not aware of any ranking function which can compute the exact probability of a document b eing relevant to a query. However, certain probabilistic models, which produce a score approximately prop ortional to logarithm of the relevance probability, have proved to b e effective for retrieval. We follow state-of-the-art and use the Okapi BM25 scoring function [32] to estimate R(d, q ). This approximation can lead to sub optimal results, however formulating a scoring function which produces exact relevance probabilities is not the focus of our work. Lucene only supp orts scoring functions based on the Vector Space Model [1]. We incorp orated the Okapi BM25 function into Lucene's source code for document ranking. The precise term weighting formula we used is: wtd = tfd × -n+0 log ( N n+0.5.5 ) d D The ab ove formula makes intuitive sense -- the two terms emb ody the two key factors in assessing the likelihood of successful task completion, loosely sp eaking: 1. the numb er of documents reachable from d that are relevant to the search task, and 2. the ease and accuracy with which the user is able to navigate to those documents. k1 × ( ( 1 - b ) + b × 4. VOLANT: A PROTOTYPE ORGANIC NAR SYSTEM dl ) av dl + tfd We built a prototyp e Organic NAR system called Volant, which implements the navigation-conscious model instantiation describ ed in Section 3.2.2. In this section, we describ e its design and implementation. 4.1 Overview Figure 2 provides an overview of Volant's architecture. Volant consists of three comp onents: content engine, connectivity engine, and intermediary. The content engine supp orts retrieval and ranking of Web documents, via a text index. The connectivity engine maintains a sp ecialized index of linkage b etween pairs of documents. The intermediary intercepts HTTP requests from the user and communicates with the connectivity and content engines as well as 5 Here we ignore the influence of navigation guidance, b ecause we have no reliable basis for assuming any particular effect. As future work we plan to extend WUFIS to account for the additional "scent" provided by navigation guidance. wtd = weight of term t in document d tfd = frequency of term t in document d N = total numb er of documents in the corpus n = numb er of documents matching term t dl = length of document d av dl = average length of a document in the corpus b, k1 = tuning parameters We set b = 0.75, k1 = 2, since these values have b een effectively used by other retrieval systems [13, 18, 31]. 4.2.2 Connectivity Engine The connectivity engine is used to estimate the value of W (N (d2 ), d1 , d2 ), i.e., the probability of a user with information need N (d2 ) successfully navigating from d1 to d2 under the WUFIS model. Let W (N (d2 ), d1 , d2 ) denote the probability of a user with information need N (d2 ) successfully navigating from d1 to d2 along path . Let denote the path for which this value is maximized. In order to keep the computation feasible, we assume that is the only path connecting d1 395 WWW 2007 / Track: Search Session: Search Potpourri VOLANT WWW Keyword query Starting Points Intermediary Get Starting Points Add Guidance HTTP Request Document with guidance Content Engine Connectivity Engine HTTP request Original document Figure 2: Volant architecture (overview appears above the horizontal line; details appear below). to d2 , ignoring other "low-scent" paths b etween d1 and d2 . Let dW denote the document immediately following d1 along . o In the preprocessing stage, for each ordered pair d1 , d2 f documents in the corpus, a variant of Dijkstra's algorithm [11] is used to generate tuples of the form d1 , d2 , dW , W (N (d2 ), d1 , d2 ) . These tuples are then indexed so as to optimize lookups based on d2 (in our case, we build database indexes on appropriate columns of our MySQL tables). 4.5 Efficiency and Scalability The primary aim of this pap er is to introduce the NAR model and evaluate its effectiveness (our next topic, in Section 5). That said, we discuss efficiency and scalability issues here briefly. We have used Volant with a corpus of over 1.5 million documents with over eight million hyp erlinks (the .GOV collection; see Section 5). Once the content and connectivity indexes have b een created, the online comp onents (starting p oint selection and link highlighting) execute at interactive sp eeds. Not surprisingly, however, the preprocessing stage (Section 4.2) is problematic in terms of scalability. In particular, creation of the connectivity index requires listing all pairs of documents. For a very large corpus it b ecomes infeasible to consider all such pairs. Devising principled techniques to limit the numb er of document pairs considered based on the likelihood of finding paths with strong scent is an imp ortant topic for future work. 4.3 Selecting Starting Points At runtime, when a query q is received, Volant constructs a ranked list of starting p oints as follows: 1. Retrieve from the content engine all documents d for which R(d , q ) > 0. 2. For each document d retrieved in Step 1, retrieve from the connectivity engine all documents d for which W (N (d ), d, d ) > 0. 3. For each unique document d identified in Step 2, compute the starting p oint score n (d, q ). 4. Sort the documents in decreasing order of n (d, q ); truncate after the top k documents. 5. EVALUATION Having presented our formal navigation-aided retrieval model and an overview of our implementation techniques, we turn now to evaluation. 5.1 Experimental Hypotheses In addition to evaluating Volant's effectiveness in combined query/navigation settings, we seek to study Volant's b ehavior in scenarios that do not entail navigation. Also of interest is the effectiveness of Organic NAR approaches (such as Volant) relative to that of Synthetic NAR approaches. Ideal exp eriments would test the following three hyp otheses: · Hypothesis 1: In query-only scenarios, Volant does not p erform significantly worse than conventional retrieval approaches. · Hypothesis 2: In combined query/navigation scenarios, Volant selects high-quality starting p oints. Quality is further enhanced by query-driven navigation guidance. · Hypothesis 3: In a significant fraction of query/navigation scenarios, the b est organic starting p oint is of higher 4.4 Adding Navigation Guidance Given query q , and document d requested by the user, Volant intercepts and modifies d by highlighting all links on d that lead to documents relevant to q . The procedure for determining what links on d to highlight is as follows: 1. Retrieve from the content engine all documents d for which R(d , q ) T , where T > 0 is a fixed relevance threshold (we used T = 0.1 in our exp eriments). 2. For each document d retrieved in Step 1, retrieve from the connectivity engine all tuples where W (N (d ), d, d ) > 0. 3. For each d, d , dW , W (N (d ), d, d ) tuple retrieved in Step 2, highlight links on d that p oint to dW . 396 WWW 2007 / Track: Search BM25 MAP score 0.23 Volant MAP score 0.20 p-value 0.22 Session: Search Potpourri 5.4 Performance on Ambiguous Queries Our main exp eriment tests Hyp othesis 2 (Section 5.1), i.e., how well Volant p erforms in scenarios for which it was conceived. This exp eriment uses the ambiguous test set (Section 5.2) to measure the usefulness of starting p oints provided by Volant in p erforming a search task via navigation. Unfortunately, the relevance judgments provided by TREC are not suitable for this exp eriment, b ecause they merely identify documents that constitute the root of a relevant subsite (or some other form of structural hub)--they provide no indication of how useful each such hub would b e in p erforming the search task. Furthermore, we cannot test the efficacy of navigation guidance (which annotates documents in a query-sp ecific manner) using the TREC judgments. Therefore, we ourselves designed and conducted a user study in order to validate Hyp othesis 2. Table 2: Performance on unambiguous queries. quality than one that can b e synthesized using existing techniques. The exp eriments we present in this pap er represent preliminary efforts toward validating the ab ove hyp otheses. Before we describ e our exp eriments, we introduce the search tasks we used as test sets. 5.2 Search Task Test Sets It is not immediately clear how one would identify navigationprone scenarios a priori, given a search task description or query. As a proxy we rely on the notion of query clarity [14], which measures query ambiguity (a high clarity score indicates low ambiguity). Query clarity has b een used to predict retrieval effectiveness. Queries for which retrieval is ineffective are likely to b e followed by query refinement and/or navigation (which may b e thought of as implicit refinement). Hence we p osit that low query clarity serves as a reasonable first-order indicator of navigation-prone scenarios. We used the Simplified Clarity Score (SCS) metric, which has b een shown to predict retrieval effectiveness well in the case of short queries [19], to guide our selection of two sets of search tasks and corresp onding queries: · Unambiguous: The 20 search tasks of highest clarity from the TREC 2000 Web track6 [17], using the "title" field as the query. (Average clarity of top-20 = 9.6.) These tasks are over the TREC WT10G corpus. · Ambiguous: 48 randomly-selected tasks from the TREC 2003 Web topic distillation track [38], which is geared toward op en-ended exploration of broad topics, again using the "title" field as the query. (Average clarity = 4.5.) These search tasks are over the TREC .GOV corpus. The WT10G and .GOV test collections have b een shown to b e structurally representative of the real Web [34]. Note that the search tasks in the topic distillation track are almost certain to b e broad, op en-ended and ambiguous in nature ­ we just use the clarity score as an added supp ort for our b elief. To prevent misclassifications in the Web track search task set, we manually screened the search tasks, and retained only those which were clearly unambiguous in nature. In our exp eriments, we use the unambiguous test set to represent "query-only" scenarios, and the ambiguous test set for "combined query/navigation" scenarios. 5.4.1 User Study Design Participants. The study comprised of 48 human judges, who were students at CMU familiar with the Web and Web searching, but with no knowledge of our pro ject. Each participant was assigned a search task (selected at random without replacement), and asked to judge the suitability of various documents as starting p oints for this search task (a within-sub jects design). A given starting p oint was judged by exactly one participant. Evaluation criteria. The participants were encouraged to explore the neighb orhood of the starting p oint while forming their assessment. Ratings were taken for four criteria, each on a scale from 1 to 5: · Breadth: Would a broad sp ectrum of p eople, who are interested in different asp ects of the search task, b e satisfied with the information obtainable via the starting p oint? · Accessibility: Given a user with a particular sp ecialized interest, would he b e able to navigate easily from the starting p oint to material that suits his interest? · Appeal: Do you like the way the material is organized and presented? 7 · Usefulness: Would most p eople b e able to complete their task successfully using this starting p oint? The breadth, accessibility and app eal criteria focus on sp ecific factors, whereas the usefulness criterion is intended to elicit an overall assessment. In this exp eriment, to maximize the information gathered from the human sub jects we configured Volant to prune redundant starting p oints in the following way: documents reachable (with high navigation probability) from the topranked starting p oint are factored out when selecting the second-ranked starting p oint, and so on. In a real-world deployment it may b e desirable to prune query results in such a manner, and indeed to enforce diversity in a more general sense8 . An in-depth study of methods for ensuring diversity 7 Based on a p osteriori discussions with some of the sub jects, we determined that app eal may have b een influenced by the fact that documents lacked the emb edded images that would normally have b een present (images were not included in the corpus used for the TREC 2003 Web topic distillation track). 8 Most p opular commercial search engines already provide such features. 5.3 Performance on Unambiguous Queries As a test of Hyp othesis 1 (Section 5.1), we measured mean average precision (MAP) scores on the unambiguous test set (Section 5.2). Results are rep orted in Table 2. Performance did not differ significantly. Indeed, for these queries the ranked query result lists produced by Volant were nearly identical to those produced by BM25. The reason is that relevant documents tended not to b e siblings or close cousins in the linkage structure. Consequently, Volant deemed that the b est starting p oints were the relevant documents themselves or, more precisely, documents estimated to have the highest chance of b eing relevant according to BM25. 6 Ad-hoc retrieval tasks with varying degrees of ambiguity. 397 WWW 2007 / Track: Search 5 Session: Search Potpourri 5 Usefulness 4 Breadth 4 Volant-G Volant-U TD 3 3 2 2 1 Top-1 Top-5 Top-10 1 Top-1 Top-5 Top-10 5 5 Accessibility 4 Appeal 4 3 3 2 2 1 Top-1 Top-5 Top-10 1 Top-1 Top-5 Top-10 Figure 3: Performance on ambiguous queries: starting point ratings averaged across all tasks/participants (with 95% confidence intervals). in NAR query results is b eyond the scop e of this pap er, and is left as future work. Starting points evaluated. For each search task, the following sets of starting p oints were included in the evaluation: First, we included the top ten starting p oints produced by Volant (these have hyp erlink highlighting). We also included a copy of this top-10 set with link highlighting removed, to allow us to isolate the impact of navigation guidance in our measurements. These 20 starting p oints constitute Organic NAR results produced by our methods. Since there is no prior work on Organic NAR, we added the top ten results of the the b est p erforming algorithm in the TREC 2003 Web topic distillation track [12]. This algorithm, develop ed by CSIRO, is a fine-tuned implementation of a sophisticated topic distillation technique, making use of a variety of document features such as url length, anchor text, off-site/on-site in-degrees and out-degrees, etc. The total of 30 starting p oints thus obtained were randomly ordered b efore presentation, and participants were not told how the starting p oints were selected. Effort. Each participant was asked to judge a total of 30 starting p oints. Since a participant was exp ected to explore the neighb orhood of a starting p oint b efore making an assessment, we allocated ten minutes p er starting p oint, which meant that each participant devoted 300 minutes (or five hours) for this study. To prevent fatigue, we distributed the study over five days, wherein each participant was required to sp end only an hour each day judging starting p oints. Physical constraints (rooms, study administrators, etc.) limited the numb er of participants op erating simultaneously in a single session to ten. Therefore, we conducted five sessions of ten users each p er day, over a p eriod of five days. This study involved a substantial investment of time, effort and money. The numb er of participants (and hence search tasks), the numb er of results considered for each algorithm and the numb er of algorithms compared were severely restricted by our resources. The decision to use only the top ten results of each algorithm allowed us to test several algorithms/variants, without excessive burden. We feel this decision is reasonable, given that users of commercial search engines typically only view the top ten results. 5.4.2 Results Figure 3 shows the starting p oint ratings, averaged across all search tasks (and hence across all participants). Each metric is plotted on a separate graph. In each graph, the leftmost group shows the results for the top-ranked starting p oints; the center group shows the average across the top five starting p oints; the rightmost group shows the average across the top ten. The lab el "Volant-G" corresp onds to Volant with navigation guidance; "Volant-U" corresp onds to Volant with no guidance; "TD" corresp onds to the topic distillation algorithm9 . We used the paired t-test to identify differences that carry statistical significance (using p = 0.05 as the cutoff ). In no case does TD p erform b etter than either of Volant-U or Volant-G with statistical significance. There are cases 9 Contrary to what one would exp ect, Volant-G and VolantU did not receive identical breadth ratings (the difference was statistically significant for Top-5 and Top-10). We attribute this difference to a discrepancy in perceived breadth. 398 WWW 2007 / Track: Search in which the converse is true, but not across all three rank groups. The only statistically significant difference across all three rank groups is that Volant-G consistently outp erforms each of Volant-U and TD in terms of usefulness. 10 Session: Search Potpourri 6. SUMMARY AND FUTURE WORK This pap er introduces a new document retrieval paradigm called navigation-aided retrieval (NAR), designed to handle p oorly-sp ecified and op en-ended search tasks in hyp ertext environments. These situations usually require a mix of querying and navigation on the part of the user. As such, NAR aims to integrate these two complementary search tactics, which to date have primarily b een treated separately. 5.4.3 Conclusions First, this exp eriment confirms the usefulness of navigation guidance. (Determining the particular form of navigation guidance that works b est is left as future work.) Second, in the absence of navigation guidance Volant p erforms on par with the b est known topic distillation algorithm, even though we p erformed almost no tuning of Volant prior to the exp eriment. While the two algorithms have similar p erformance, Volant offers three imp ortant advantages: · Unlike the topic distillation algorithm, which uses a combination of heuristics and has b een tuned extensively, Volant is based on a clean theoretical model. · Volant invokes a conventional retrieval function as a subroutine, so advances in conventional retrieval technology are trivial to incorp orate into Volant to achieve a corresp onding improvement. · As indicated by our first exp eriment (Section 5.3), Volant is suitable for unambiguous queries as well as ambiguous ones. This prop erty may obviate the need to choose b etween two retrieval algorithms (e.g., conventional IR and topic distillation) for each query. Effectiveness. The salient properties of the NAR paradigm are: (1) resp onding to queries by p ositioning users at suitable starting p oints for exploratory navigation; (2) helping to guide navigation in a query-driven fashion. Our exp eriments showed that the scoring function implied by our NAR model p erforms as well as the b est-known topic distillation algorithm at selecting suitable starting p oints, and that our navigation guidance mechanism is of significant b enefit compared with no guidance. Relationship to conventional IR. Our formal model of navigation-aided retrieval constitutes a strict generalization of the conventional probabilistic IR model. Furthermore, our empirical work suggests that in the case of unambiguous queries for which conventional IR techniques are sufficient, NAR reduces to standard IR automatical ly. This prop erty, if confirmed through further exp eriments, would obviate the need to choose from two alternative retrieval methods based on the nature of the search task. 5.5 Organic versus Synthetic NAR To obtain a preliminary understanding of the relationship b etween Organic and Synthetic NAR, we injected a synthetic starting p oint document into the set of starting p oints evaluated by each of our human sub jects. We used the well-known search result clustering algorithm of Zamir and Etzioni [40] to generate synthetic starting p oints. Each cluster produced by the algorithm is treated as a starting p oint, with links to the documents b elonging to that cluster. Our results so far are inconclusive. On the one hand, if we compare the top-ranked starting p oint suggested by Volant against the synthetic starting p oint generated by the clustering algorithm, we find that for breadth, accessibility and app eal, clustering outp erforms Volant by a statistically significant margin. A significant difference for usefulness was not found. Examining usefulness on a task-by-task basis, we find that the clustering starting p oint was rated more useful in 40% of the search tasks, while Volant's top starting p oint was rated more useful in 17% of the search tasks (in 43% of the search tasks there was a tie).11 On the other hand, if we broaden our focus to include all ratings available, we find that for at least 44% of the search tasks the corpus contains a document that, when navigation guidance is added, is considered more useful than the clustering starting p oint. If we also include documents that were rated as equally useful, that figure rises to 96%. The p-values for Volant-G versus Volant-U are all b elow 0.005 (highly significant); for Volant-G versus TD, all are b elow 0.005 (again, highly significant) except for Top-1, where p = 0.04 (b orderline significant). 11 It is unknown whether these cases arise due to characteristics of the search task, the sub ject's p ersonal tastes, Volant's manner of ranking starting p oints or, more likely, some combination of factors. 10 Relationship to synthetic approaches. Search result clustering can b e thought of as form of navigation-aided retrieval, as the user is exp ected to navigate within the dynamically constructed topic hierarchy b efore converging on the desired document(s). We characterize methods of this form as Synthetic NAR, whereas our method can b e thought of as Organic NAR. Although Synthetic NAR has the obvious advantage of b eing unconstrained by pre-existing structure, in our (alb eit preliminary) exp eriments we found that Organic NAR has significant p otential. Sp ecifically, in nearly all cases (96%) the b est organic starting p oint was rated to b e at least as useful as the topic hierarchy created by a well regarded clustering algorithm. Furthermore, in a significant fraction of cases (44%) the b est organic starting p oint received a higher rating than the clustering result. Future work. While these initial results are encouraging, there is more work to b e done. First, to p erform well over corp ora that contain substantial redundancy, it may b e appropriate to generate only topical ly diverse starting p oints. Second, we would like to refine our scoring function so that it can b e applied to synthetic starting p oints as well as organic ones, while still yielding good predictions. Ultimately, we envision a unified method that presents the user with directly relevant documents (in the case of unambiguous queries), organic starting p oints for exploration (in the case of ambiguous queries), or synthetic starting p oints (in the case that the corpus does not contain suitable organic ones). Acknowledgements We gratefully acknowledge the assistance of Gary Hsieh, Dhananjay Khaitan, Paul Ogilvie, Sandeep Pandey, Samuel Wang and Andrew Yang with the extensive implementation and evaluation effort that went into this pro ject. We also 399 WWW 2007 / Track: Search thank Jamie Callan, Rosie Jones and Malcolm Slaney for helpful discussions and feedback. [20] Session: Search Potpourri pre-retrieval predictors. In Proc. Symposium on String Processing and Information Retrieval, 2004. T. Joachims, D. Freitag, and T. Mitchell. WebWatcher: A tour guide for the World Wide Web. In Proc. International Joint Conference on Artificial Intel ligence (IJCAI), 1997. D. Kemp e, J. Kleinb erg, and E. Tardos. Maximizing the spread of influence through a social network. In Proc. SIGKDD, 2003. J. M. Kleinb erg. Authoritative sources in a hyp erlinked environment. J. ACM, 46(5), 1999. V. Lavrenko and W. B. Croft. Relevance-based language models. In Proc. SIGIR, 2001. H. Lieb erman. Letizia: An agent That assists Web browsing. In Proc. IJCAI, 1995. J. Miller, G. Rae, F. Schaefer, L. Ward, T. LoFaro, and A. Farahat. Modifications of kleinb erg's hits algorithm using matrix exp onentiation and web log records. In Proc. SIGIR, 2001. D. Mladenic. Using text learning to help Web browsing. In Proc. SIGCHI, 2001. MySQL DBMS. http://www.mysql.com/. C. Olston and E. H. Chi. ScentTrails: Integrating browsing and searching on the Web. ACM Trans. on Computer-Human Interaction, 10(3), 2003. P. Pirolli and S. Card. Information Foraging. Psychological Review, 1999. F. Qiu, Z. Liu, and J. Cho. Analysis of user web traffic with a focus on search activities. In Proc. International Workshop on the Web and Databases (WebDB), June 2005. S. Rob ertson, S. Walker, M. Hancock-Beaulieu, A. Gull, and M. Lau. Okapi at TREC. In Proc. TREC, 1992. S. E. Rob ertson, S. Walker, M. Hancock-Beaulieu, A. Gull, and M. Lau. Okapi at TREC. In Text REtrieval Conference, 1992. V. Sinha and D. R. Karger. Magnet: Supp orting navigation in semistructured data environments. In Proc. SIGMOD, 2005. I. Sob oroff. Do TREC Web Collections Look Like the Web? SIGIR Forum, pages 23­31, 2002. K. Sparck Jones, S. Walker, and S. Rob ertson. A probabilistic model of information retrieval: Development and comparative exp eriments. Information Processing and Management, 36, 2000. J. Teevan, C. Alvarado, M. S. Ackerman, and D. R. Karger. The p erfect search engine is not enough: A study of orienteering b ehavior in directed search. In Proc. SIGCHI, 2004. Apache Tomcat. http://tomcat.apache.org/. TREC-2003 Web Track: Guidelines. http: //es.csiro.au/TRECWeb/guidelines 2003.html. R. Wheeldon and M. Levene. The Best Trail algorithm for assisted navigation of Web sites. In Proc. LA-WEB Conference on Latin American Web Congress, 2003. O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proc. SIGIR, 1998. H.-J. Zeng, Q.-C. He, Z. Chen, W.-Y. Ma, and J. Ma. Learning to cluster web search results. In Proc. SIGIR, 2004. 7. REFERENCES [21] [1] R. A. Baeza-Yates and B. A. Rib eiro-Neto. Modern Information Retrieval. [2] M. J. Bates. The Design of Browsing and Berrypicking Techniques for the On-line Search Interface. Online Review, 13:407­431, 1989. [3] N. J. Belkin. Anomalous States of Knowledge as the Basis of Information Retrieval. Canadian Journal of Information Science, 5:133­143, 1980. [4] K. Bharat, A. Broder, M. Henzinger, P. Kumar, , and S. Venkatasubramanian. The Connectivity Server: fast access to linkage information on the Web. In Proc. WWW, 1998. [5] K. Bharat and M. Henzinger. Improved algorithms for topic distillation in a hyp erlinked environment. In Proc. SIGIR, 1998. [6] S. Chakrabarti, B. Dom, D. Gibson, J. Kleinb erg, P. Raghavan, and S. Ra jagopalan. Automatic resource list compilation by analyzing hyp erlink structure and associated text. In Proc. WWW, 1998. [7] S. Chakrabarti, B. E. Dom, D. Gibson, R. Kumar, P. Raghavan, S. Ra jagopalan, and A. Tomkins. Exp eriments in topic distillation. In Proc. SIGIR Workshop on Hypertext Information Retrieval on the Web, 1998. [8] S. Chakrabarti, M. Joshi, and V. Tawde. Enhanced topic distillation using text, markup tags, and hyp erlinks. In Proc. SIGIR, 2001. [9] E. H. Chi, P. Pirolli, K. Chen, and J. Pitkow. Using information scent to model user information needs and actions and the Web. In Proc. SIGCHI, 2001. [10] E. H. Chi, A. Rosien, G. Supattanasiri, A. Williams, C. Royer, C. Chow, E. Robles, B. Dalal, J. Chen, and S. Cousins. The Bloodhound pro ject: Automating discovery of web usability issues using the InfoScent simulator. In Proc. SIGCHI, 2003. [11] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. MIT Press/McGraw-Hil l. second edition, 2001. [12] N. Craswell, D. Hawking, A. McLean, T. Upstill, R. Wilkinson, and M. Wu. TREC12 web and interactive track at CSIRO, 2003. [13] N. Craswell, D. Hawking, and S. Rob ertson. Effective Site Finding Using Link Anchor Information. In Research and Development in Information Retrieval, pages 250­257, 2001. [14] S. Cronen-Townsend, Y. Zhou, and W. B. Croft. Predicting query p erformance. In Proc. SIGIR, 2002. [15] S. Dumais, E. Cutrell, and H. Chen. Optimizing search by showing results in context. In Proc. SIGCHI, 2001. [16] J. English, M. Hearst, R. Sinha, K. Swearingen, and K. Yee. Flexible Search and Browsing using Faceted Metadata. http://bailando.sims.berkeley.edu/ papers/flamenco02.pdf, Jan. 2002. [17] D. Hawking. Overview of the TREC-9 web track. [18] D. Hawking, T. Upstill, and N. Craswell. Toward b etter weighting of anchors. In Proc. SIGIR, 2004. [19] B. He and I. Ounis. Inferring query p erformance using [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] 400