Respect My Authority! HITS Without Hyperlinks, Utilizing Cluster-Based Language Models Oren Kurland and Lillian Lee Depar tment of Computer Science Cornell University Ithaca, NY 14853-7501 kurland@cs.cornell.edu, llee@cs.cornell.edu ABSTRACT We present an approach to improving the precision of an initial document ranking wherein we utilize cluster information within a graph-based framework. The main idea is to perform re-ranking based on centrality within bipartite graphs of documents (on one side) and clusters (on the other side), on the premise that these are mutually reinforcing entities. Links between entities are created via consideration of language models induced from them. We find that our cluster-document graphs give rise to much better retrieval performance than previously proposed document-only graphs do. For example, authority-based re-ranking of documents via a HITS-style cluster-based approach outperforms a previously-proposed PageRank-inspired algorithm applied to solely-document graphs. Moreover, we also show that computing authority scores for clusters constitutes an effective method for identifying clusters containing a large percentage of relevant documents. Categories and Sub ject Descriptors: H.3.3 [Information Search and Retrieval]: Retrieval mo dels General Terms: Algorithms, Exp erimentation Keywords: bipartite graph, clusters, language mo deling, HITS, hubs, authorities, high-accuracy retrieval, graph-based retrieval, structural re-ranking, cluster-based language mo dels 1. INTRODUCTION To improve the precision of retrieval output, especially within the very few (e.g, 5 or 10) highest-ranked documents that are returned, a number of researchers [36, 13, 16, 7, 22, 34, 25, 1, 18, 9] have considered a structural re-ranking strategy. The idea is to re-rank the top N documents that some initial search engine produces, where the re-ordering utilizes information about inter-document relationships within that set. Promising results have been previously obtained by using document centrality within the initially retrieved list to perform structural re-ranking, on the premise that if the quality of this list is reasonable to begin with, then Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00. the documents that are most related to most of the documents on the list are likely to be the most relevant ones. In particular, in our prior work [18] we adapted PageRank [3] -- which, due to the success of Google, is surely the most well-established algorithm for defining and computing centrality within a directed graph -- to the task of re-ranking non-hyperlinked document sets. The arguably most well-known alternative to PageRank is Kleinberg's HITS algorithm [16]. The ma jor conceptual way in which HITS differs from PageRank is that it defines two different types of central items: each node is assigned both a hub and an authority score as opposed to a single PageRank score. In the Web setting, in which HITS was originally proposed, good hubs correspond roughly to highquality resource lists or collections of pointers, whereas good authorities correspond to the high-quality resources themselves; thus, distinguishing between two differing but interdependent types of Webpages is quite appropriate. Our previous study [18] applied HITS to non-Web documents. We found that its performance was comparable to or better than that of algorithms that do not involve structural re-ranking; however, HITS was not as effective as PageRank [18]. Do these results imply that PageRank is better than HITS for structural re-ranking of non-Web documents? Not necessarily, because there may exist graph-construction methods that are more suitable for HITS. Note that the only entities considered in our previous study were documents. If we could introduce entities distinct from documents but enjoying a mutually reinforcing relationship with them, then we might better satisfy the spirit of the hubs-versus-authorities distinction, and thus derive stronger results utilizing HITS. A crucial insight of the present paper is that document clusters appear extremely well-suited to play this complementary role. The intuition is that: (a) given those clusters that are "most representative" of the user's information need, the documents within those clusters are likely to be relevant; and (b) the "most representative" clusters should be those that contain many relevant documents. This apparently circular reasoning is strongly reminiscent of the interrelated hubs and authorities concepts underlying HITS. Also, clusters have long been considered a promising source of information. The well-known cluster hypothesis [35] encapsulates the intuition that clusters can reveal groups of relevant documents; in practice, the potential utility of clustering for this purpose has been demonstrated for both the case wherein clusters were created in a query-independent fashion [14, 4] and the re-ranking setting [13, 22, 34]. 83 In this paper, we show through an array of experiments that consideration of the mutual reinforcement of clusters and documents in determining centrality can lead to highly effective algorithms for re-ranking an initially retrieved list. Specifically, our experimental results show that the centralityinduction methods that we previously studied solely in the context of document-only graphs [18] result in much better re-ranking performance if implemented over bipartite graphs of documents (on one side) and clusters (on the other side). For example, ranking documents by their "authoritativeness" as computed by HITS upon these cluster-document graphs yields better performance than that of a previously proposed PageRank implementation applied to documentonly graphs. Interestingly, we also find that cluster authority scores can be used to identify clusters containing a large percentage of relevant documents. in some sense. Then, v 's authority score X auth(v ) = w t(u v ) · hub(u) uV (1) would be a natural measure of how "good" v is, since a node that is "strongly" pointed to by high-quality hubs (which, by definition, tend to point to "good" nodes) receives a high score. But where do we get the hub score for a given node u? A natural choice is to use the extent to which u "strongly" points to highly authoritative nodes: X w t(u v ) · auth(v ). (2) hub(u) = v V 2. ALGORITHMS FOR RE-RANKING Since we are focused on the structural re-ranking paradigm, our algorithms are applied not to the entire corpus, but to a N ,q subset Dinit (henceforth Dinit ), defined as the top N documents retrieved in response to the query q by a given initial retrieval engine. Some of our algorithms also take into account a set C l(Dinit ) of clusters of the documents in Dinit . We use Sinit to refer generically to whichever set of entities -- either Dinit or Dinit C l(Dinit ) -- is used by a given algorithm. The basic idea behind the algorithms we consider is to determine centrality within a relevance-flow graph, defined as a directed graph with non-negative weights on the edges in which · the nodes are the elements of Sinit , and · the weight on an edge between node u and v is based on the strength of evidence for v 's relevance that would follow from an assertion that u is relevant. By construction, then, any measure of the centrality of s Sinit should measure the accumulation of evidence for its relevance according to the set of interconnections among the entities in Sinit. Such information can then optionally be sub jected to additional processing, such as integration with information on each item's similarity to the query, to produce a final re-ranking of Dinit . Clearly, Equations 1 and 2 are mutually recursive. However, the iterative HITS algorithm1 provably converges to (nonidentically-zero, non-negative) score functions hub and auth that satisfy the above pair of equations. Figure 1 depicts the "iconic" case in which the input graph G is one-way bipartite, that is, V can be partitioned into non-empty sets VLeft and VRight such that only edges in VL Peft × VRight can receive positive weight, and u VLeft , v VRight w t(u v ) > 0. It is the case that auth (u) = 0 for every u VLeft and hub (v ) = 0 for every v VRight ; in this sense, the left-hand nodes are "pure" hubs and the right-hand nodes are "pure" authorities. Figure 1: A one-way bipartite graph. We only show positive-weight edges (omitting weight values). According to HITS, the left-hand nodes are (pure) hubs; the right-hand ones are (pure) authorities. Note that in the end, we need to produce a single centrality score for each node n V . For experimental simplicity, we consider only two possibilities in this paper -- using auth (n) as the final centrality score, or using hub (n) instead-- although combining the hub and authority scores is also an interesting possibility. 2.2 Graph schemata: incorporating clusters Recall that the fundamental operation in our structural re-ranking paradigm is to compute the centrality of entities (with)in a set Sinit. One possibility is to define Sinit as Dinit , the documents in the initially retrieved set; we refer generically to any relevance-flow graph induced under this choice as a document-to-document graph. But note that for non-Web documents, it may not be obvious a priori what kinds of documents are hubs and what kinds are authorities. Alternatively, we can define Sinit as Dinit C l(Dinit ), where C l(Dinit ) consists of clusters of the documents in Dinit . On a purely formal level, doing so allows us to map the hubs/authorities duality discussed above onto the documents/clusters duality, as follows. Recalling our discussion of the "iconic" case of one-way bipartite graphs G = ((VLeft , VRight ), w t), we can create document-as-authority graphs simply by Strictly speaking, the algorithm and proof of convergence as originally presented [16] need (trivial) modification to apply to edge-weighted graphs. 1 Conventions regarding graphs. The types of relevanceflow graphs we consider can all be represented as weighted directed graphs of the form (V , w t), where V is a finite nonempty set of nodes and w t : V × V [0, ) is a nonnegative edge-weight function. Note that thus our graphs technically have edges between all ordered pairs of nodes (self-loops included); however, edges with zero edge-weight are conceptually equivalent to missing edges. For clarity, we write w t(u v ) instead of w t(u, v ). 2.1 Hubs, authorities, and the HITS algorithm The HITS algorithm for computing centrality can be motivated as follows. Let G = (V , w t) be the input graph, and let v be a node in V . First, suppose we somehow knew the hub score hub(u) of each node u V , where "hubness" is the extent to which the nodes that u points to are "good" 84 choosing VLeft = C l(Dinit ) and VRight = Dinit , so that necessarily clusters serve the role of (pure) hubs and documents serve the role of (pure) authorities. Contrariwise,2 we can create document-as-hub graphs by setting VLeft = Dinit and VRight = C l(Dinit ). But the advantages of incorporating cluster-based information are not just formal. The well-known cluster hypothesis [35] encapsulates the intuition that clusters can reveal groups of relevant documents; in practice, the potential utility of clustering for this purpose has been demonstrated a number of times, whether the clusters were created in a query-independent fashion [14, 4], or from the initially mosthighly-ranked documents for some query [13, 22, 34] (i.e., in the re-ranking setting). Since central clusters are, supposedly, those that accrue the most evidence for relevance, documents that are strongly identified with such clusters should themselves be judged highly relevant.3 4 But identifying such clusters is facilitated by knowledge of which documents are most likely to be relevant -- exactly the mutual reinforcement property that HITS was designed to leverage. Equation 4 is recursive, but there are iterative algorithms that provably converge to the unique positive solution PR P satisfying the sum-normalization constraint vV PR(v ) = 1 [21]. Moreover, a (non-trivial) closed-form -- and quite easily computed -- solution exists for one-way bipartite graphs: Theorem 1. If G = (V , w t) is one-way bipartite, then X w t(u v ) def PRb ip (v ) = (5) out(u) uV :out(u)>0 is an affine transformation (with respect to positive constants) of, and therefore equivalent for ranking purposes to, the unique positive sum-normalized solution to Equation 4. (Proof omitted due to space constraints.) Interestingly, this result shows that while one might have thought that clusters and documents would "compete" for PageRank score when placed within the same graph, in our document-as-authority and document-as-hub graphs this is not the case. Earlier work [18] also considered scoring a node v by its P influx, uV w t(u v ). This can b e viewed as either a non-recursive version of Equation 3, or as an un-normalized analog of Equation 5. 2.3 Alternative scores: PageRank and influx We will compare the results of using the HITS algorithm against those derived using PageRank instead. This is a natural comparison because PageRank is the most well-known centrality-induction algorithm utilized for ranking documents, and because in earlier work [18], PageRank performed quite well as a tool for structural re-ranking of non-Web documents, at least when applied to document-to-document graphs. One can think of PageRank as a version of HITS in which the hub/authority distinction has been collapsed. Thus, writing "PR" for both auth and hub, we conceptual ly have the (single) equation X PR(v ) = w t(u v ) · PR(u). (3) uV 2.4 Algorithms based on centrality scores Clearly, we can rank documents by their scores as computed by any of the functions introduced above. But when we operate on document-as-authority or document-as-hub graphs, centrality scores for the clusters are also produced. These can be used to derive alternative means for ranking documents. We follow Liu and Croft's approach [25]: first, rank the documents within (or most strongly associated to) each cluster according to the initial retrieval engine's scores; then, derive the final list by concatenating the within-cluster lists in order of decreasing cluster score, discarding repeats. Such an approach would be successful if cluster centrality is strongly correlated with the property of containing a large percentage of relevant documents. However, in practice, we incorporate Brin and Page's smoothing scheme [3] together with a correction for nodes with no positive-weight edges emanating from them [27, 21]: » ­ X (1 - ) w t(u v ) PR(v ) = + · PR(u) |V | out(u) uV :out(u)>0 Ranking algorithms. Since we have two possible ranking paradigms, we adopt the following algorithm naming conventions. Names consist of a hyphen-separated prefix and suffix. The prefix ("doc" or "clust") indicates whether documents were ranked directly by their centrality scores, or indirectly through the concatenation process outlined above in which it is the clusters' centrality scores that were employed. The suffix ("Auth", "Hub", "PR", or "Influx") indicates which score function (auth , hub , PR (or PRb ip ), or influx) was used to measure centrality. For a given re-ranking algorithm, we indicate the graph upon which it was run in brackets, e.g., "doc-Auth[G]". + where out(u) = damping factor.5 def uV :out(u)=0 X P 1 · PR(u) |V | (4) v V w t(u v ), and (0, 1) is the 2 In practice, one can simultaneously compute the output of HITS for a given document-as-authority and document-ashub graph pair by "overlaying" the two into a single graph and suitably modifying HITS's normalization scheme. 3 We say "are strongly identified with", as opposed to "belong to" to allow for overlapping or probabilistic clusters. Indeed, the one-way bipartite graphs we construct are illsuited to the HITS algorithm if document-to-cluster links are based on membership in disjoint clusters. 4 This is, in some sense, a type of smoothing: a document might be missing some of the query terms (perhaps due to synonymy), but if it lies within a sector of "document space" containing many relevant documents, it could still be deemed highly relevant. Recent research pursues this smoothing idea at a deeper level [25, 17]. 5 Under the original "random surfer" model, the sum of the 3. RELATED WORK The potential merits of query-dependent clustering, that is, clustering the documents retrieved in response to a query, have long been recognized [30, 36, 23, 34, 25], especially in interactive retrieval settings [13, 22, 32]. However, automatically detecting clusters that contain many relevant documents remains a very hard task [36]. Section 5.2 presents retransition probabilities out of "no outflow" nodes -- which are abundant in one-way bipartite graphs -- would be (1 - ), not 1. Conceptually, the role of the second summation in Equation 4 is to set = 0 for these no-outflow nodes. 85 sults for detecting such clusters using centrality-based cluster ranking. Recently, there has been a growing body of work on graphbased modeling for different language-processing tasks wherein links are induced by inter-entity textual similarities. Examples include document (re-)ranking [7, 24, 9, 18, 39], text summarization [11, 26], sentence retrieval [28], and document representation [10]. In contrast to our methods, links connect entities of the same type, and clusters of entities are not modeled within the graphs. While ideas similar to ours by virtue of leveraging the mutual reinforcement of entities of different types, or using bipartite graphs of such entities for clustering (rather than using clusters), are abundant (e.g., [15, 8, 2]), we focus here on exploiting mutual reinforcement in ad hoc retrieval. Markov chains (with early stopping) over bipartite graphs of terms and documents were used for query expansion [20], but in contrast to our work, no stationary solution was sought. A similar "short chain" approach utilizing bipartite graphs of clusters and documents for ranking an entire corpus was recently proposed [19], thereby constituting the work most resembling ours. However, in addition to not seeking a stationary solution, query drift prevention mechanisms were required to obtain good performance; in our re-ranking setting, we need not employ such mechanisms. relevance flow is not symmetric [18]. Moreover, this function is somewhat insensitive to large length differences between the items in question [18], which is advantageous when both documents and clusters (which we treat as very long documents) are considered. Previous work [18, 33] makes heavy use of the idea of nearest neighbors in language-model space. It is therefore convenient to introduce the notation N bhd(x | m, R), pronounced "neighborhood", to denote the m items y within the "restriction set" R that have the highest values of rflow(x, y ) (we break ties by item ID, assuming that these have been assigned to documents and clusters). Note that the neighborhood of x corresponds to what we previously termed the "top generators" of x [18]. Graphs used in experiments. For a given set Dinit of initially retrieved documents and positive integer (an "outdegree" parameter), we consider the following three graphs. Each connects nodes u to the other nodes, drawn from some specified set, that u has the highest relevance flow to. The document-to-document graph dd has vertex set Dinit and weight function ( dd (u, v ) = rflow(u, v ) if v N bhd(u | , Dinit - {u}), wt 0 otherwise. The document-as-authority graph cd has vertex set Dinit C l(Dinit ) and a weight function such that positive-weight edges go only from clusters to documents: 8 >rflow(u, v ) if u C l(Dinit) and < w tcd (u, v ) = v N bhd(u | , Dinit ), > :0 otherwise. The document-as-hub graph dc has vertex set Dinit C l(Dinit ) and a weight function such that positive-weight edges go only from documents to clusters: 8 >rflow(u, v ) if u Dinit and < w tdc (u, v ) = v N bhd(u | , C l(Dinit )), > :0 otherwise. Since the latter two graphs are one-way bipartite, Theorem 1 applies to them. 4. EVALUATION FRAMEWORK Most aspects of the evaluation framework described below are adopted from our previous experiments with noncluster-based structural re-ranking [18] so as to facilitate direct comparison. Section 4.1 of [18] provides a more detailed justification of the experimental design. The main conceptual changes6 here are: a slightly larger parameter search-space for the "out-degree" parameter (called the "ancestry" parameter in [18]); and, of course, the incorporation of clusters. 4.1 Graph construction Relevance flow based on language models (LMs). To estimate the degree to which one item, if considered relevant, can vouch for the relevance of another, we follow our previous work on document-based graphs [18] and utilize [µ] pd (·), the unigram Dirichlet-smoothed language model induced from a given document d (µ is the smoothing parameter) [38]. To adapt this estimation scheme to settings [µ] involving clusters, we derive the language model pc (·) for a cluster c by treating c as the (large) document formed by concatenating7 its constituent (or most strongly associated) documents [17, 25, 19]. The relevance-flow measure we use is essentially a directed similarity in language-model space: "" " " def [ (6) rflow(x, y ) = exp -D px0] (·) p[µ] (·) , y where D is the Kullback-Leibler divergence. The asymmetry of this measure corresponds nicely to the intuition that 6 Clustering Method. Clearly, our cluster-based graphs require the construction of clusters of the documents in Dinit . Since this set is query-dependent, at least some of the clustering process must occur at retrieval time, mandating the use of extremely efficient algorithms [6, 37]. The approach we adopt is to use overlapping nearest-neighbor clusters, which have formed the basis of effective retrieval algorithms in other work [12, 17, 19, 33]: for each document d Dinit , we have the cluster {d} N bhd(d | k - 1, Dinit - {d}), where k is the cluster-size parameter. 4.2 Experimental Setup We conducted our experiments on three TREC datasets: corpus AP TREC8 WSJ # of docs 242,918 528,155 173,252 queries 51-64, 66-150 401-450 151-200 disk(s) 1-3 4-5 1-2 Some of the PageRank results appearing in our previous paper [18] accidentally reflect experiments utilizing a suboptimal choice of Dinit . For citation purposes, the numbers reported in the current paper should be used. 7 Concatenation order is irrelevant for unigram LMs. 86 do c-Auth[dd] do c-PageRank[dd ] do c-Auth[cd] prec@5 .509 .519 .541 AP prec@10 .486 .480 .501 p MRR .638 .632 .669 p prec@5 .440 .524 .544 a TREC8 prec@10 .424 .446 .452 MRR .648 .666 .674 prec@5 .504 .536 .564 a WSJ prec@10 .464 .486 .514 a MRR .638 .699 .746 a Table 1: Main comparison: HITS or PageRank on document-only graphs versus HITS on cluster-to-document graphs. Bold: best results per column. Symbols "p" and "a": doc-Auth[cd] result differs significantly from that of doc-PageRank[dd] or doc-Auth[dd], respectively. We applied basic tokenization and Porter stemming via the Lemur toolkit (www.lemurpro ject.org), which we also used for language-model induction. Topic titles served as queries. In many retrieval situations of interest, ensuring that the top few documents retrieved (a.k.a., "the first page of results") tend to be relevant is much more important than ensuring that we assign relatively high ranks to the entire set of relevant documents in aggregate [31]. Hence, rather than use mean average precision (MAP) as an evaluation metric, we apply metrics more appropriate to the structural re-ranking task: precision at the top 5 and 10 documents (henceforth prec@5 and prec@10, respectively) and the mean reciprocal rank (MRR) of the first relevant document [31]. All performance numbers are averaged over the set of queries for a given corpus. The natural baseline for the work described here is the standard language-model-based retrieval approach [29, 5], since it is an effective paradigm that makes no explicit use of inter-document relationships. Specifically, for a given evaluation metric e, the corresponding optimized baseline is the [µ(e)] (q ), where µ(e) is ranking on documents produced by pd the value of the Dirichlet smoothing parameter that results in the best retrieval performance as measured by e. A ranking method might assign different items the same score; we break such ties by item ID. Alternatively, the scores used to determine Dinit can be utilized, if available. as possible; but the advantage of our policy is that we can see whether optimization with respect to a fixed criterion yields good results no matter how "goodness" is measured. Parameter values were selected from the following sets. The graph "out-degree" : {2, 4, 9, 19, 29, 39, 49}. The cluster size k: {2, 5, 10, 20, 30}. The PageRank damping factor : {0.05, 0.1 . . . 0.9, 0.95}. 5. EXPERIMENTAL RESULTS In what follows, when we say that results or the difference between results are "significant", we mean according to the two-sided Wilcoxon test at a confidence level of 95%. 5.1 Re-Ranking by Document Centrality Main result. We first consider our main question: can we substantially boost the effectiveness of HITS by applying it to cluster-to-document graphs, which we have argued are more suitable for it than the document-to-document graphs we constructed in our previous work [18]? The answer, as shown in Table 1, is clearly "yes": we see that moving to cluster-to-document graphs results in substantial improvement for HITS, and indeed boosts its results over those for PageRank on document-to-document graphs. Full suite of comparisons. We now turn to Figure 2, which gives the results for the re-ranking algorithms docInflux, doc-PageRank and doc-Auth as applied to either the document-based graph dd (as in [18]) or the clusterdocument graph cd. (Discussion of doc-Hub is deferred to Section 5.3.) To focus our discussion, it is useful to first point out that in almost all of our nine evaluation settings (3 corpora × 3 evaluation measures), all three of the re-ranking algorithms perform better when applied to cd graphs than to dd graphs, as the number of dark bars in Figure 2 indicates. Since it is thus clearly useful to incorporate cluster-based information, we will now mainly concentrate on cd-based algorithms. The results for prec@5, the metric for which the re-ranking algorithms' parameters were optimized, show that al l cdbased algorithms outperform the prec@5-optimized baseline -- significantly so for the AP corpus -- even though applied to a sub-optimally-ranked initial set. (We hasten to point out that while the initial ranking is always inferior to the corresponding optimized baseline, the differences are never significant.) In contrast, the use of dd graphs never leads to significantly superior prec@5 results. We also observe in Figure 2 that the doc-Auth[cd] algorithm is always either the best of the cd-based algorithms or clearly competitive with the best. Furthermore, pairwise comparison of it to each of the doc-Influx[cd] Parameter selection for graph-based methods. There are two motivations underlying our approach to choosing values for our algorithms' parameters [18]. First, we hope to show that structural re-ranking can provide better results than the optimized baselines even when initialized with a sub-optimal (yet reasonable) ranking. Hence, let the initial ranking be the document ordering [µ ] induced on the entire corpus by pd 1000 (q ), where µ1000 is the smoothing-parameter value optimizing the average noninterpolated precision of the top 1000 documents. We set Dinit to the top 50 documents in the initial ranking. Second, we wish to show that good results can be achieved without a great deal of parameter tuning. Therefore, we did not tune the smoothing parameter for any of the language models used to determine graph edge-weights, but rather simply set µ = 2000 when smoothing was required, following a prior suggestion [38]. Also, the other free parameters' values were chosen so as to optimize prec@5, regard less of the evaluation metric under consideration.8 As a consequence, our prec@10 and MRR results are presumably not as high 8 If two different parameter settings yield the same prec@5, we choose the setting minimizing prec@10 so as to provide a conservative estimate of expected performance. Similarly, if we have ties for both prec@5 and prec@10, we choose the setting minimizing MRR. 87 0.58 AP TREC8 WSJ 0.56 0.54 0.52 0.5 0.48 0.46 0.44 0.42 0.52 0.51 0.5 0.49 0.48 AP TREC8 WSJ 0.47 0.46 0.45 0.44 0.43 0.42 0.78 0.76 0.74 0.72 0.7 AP TREC8 WSJ 0.68 0.66 0.64 0.62 0.6 0.58 of ranking documents, in the sense of identifying clusters that contain a high percentage of relevant documents. Note that the problem of automatically identifying such clusters in the re-ranking setting has been acknowledged to be a hard task for some time [36]. Nevertheless, as stated in Section 2.4, we experimented with Liu and Croft's general clustersfor-selection approach [25]: rank the clusters, then rank the [µ] documents within each cluster by pd (q ). Our baseline algo[µ] rithm, clust-pc (q ), adopts Liu and Croft's specific proposal of the CQL algorithm -- except that we employ overlapping rather than hard clusters -- wherein clusters are ranked by [µ] the query likelihood pc (q ) instead of one of our centrality scores. Table 2 (which may appear on the next page) presents the performance results. Our first observation is that the clustInflux[dc] and clust-Auth[dc] algorithms are superior in a ma jority of the relevant comparisons to the initial rank[µ] ing, the optimized baselines, and the clust-pc (q ) algorithm, where the performance differences with the latter sometimes achieve significance. However, the performance of the document-centrality-based algorithm doc-Auth[cd] is better in a ma jority of the evaluation settings than that of any of the cluster-centralitybased algorithms. On the other hand, it is possible that the latter methods could be improved by a better technique for within-cluster ranking. To compare the effectiveness of clust-Influx[dc] and clust[µ] Auth[dc] to that of clust-pc (q ) in detecting clusters with a high percentage of relevant documents -- thereby neutralizing within-cluster ranking effects -- we present in Table 3 the percent of documents in the highest ranked cluster that are relevant. (Cluster size (k) was fixed to either 5 or 10 and out-degree ( ) was chosen to optimize the above percentage.) Indeed, these results clearly show that our best [µ] cluster-based algorithms are much better than clust-pc (q ) in detecting clusters containing a high percentage of relevant documents, in most cases to a significant degree. Cluster ranking pc (q ) Influx[dc] Auth[dc] [µ] prec@5 it in th k Au an cR do ge Pa cdo flux In e cn do seli a tb g op kin n ra it in th k Au an cR do ge Pa cdo flux In e cn do seli a tb g op kin n ra it in th k Au an cR do ge Pa cdo flux In e cn do seli a tb g op kin n ra prec@10 in it ra op tb nk in g do cas el in e do cIn flu x do cAu th k R an ge Pa in it ra op tb nk in g do cas el in e do cIn flu x do cAu th k R an ge Pa in it ra op tb nk in g do cas el in e do cIn flu x do cAu th k R an ge Pa MRR Figure 2: All re-ranking algorithms, as applied to either dd graphs or cd graphs. and doc-PageRank[cd] algorithms favors the HITS-style doc-Auth[cd] algorithm in a ma jority of the evaluation settings. We also experimented with a few alternate graph-construction methods, such as sum-normalizing the weights of edges out of nodes, and found that the doc-Auth[cd] algorithm remained superior to doc-Influx[cd] and doc-PageRank[cd]. We omit these results due to space constraints. All in all, these findings lead us to believe that not only is it useful to incorporate information from clusters, but it can be more effective to do so in a way reflecting the mutuallyreinforcing nature of clusters and documents, as the HITS algorithm does. 5.2 Re-Ranking by Cluster Centrality We now consider the alternative, mentioned in Section 2.4, of using the centrality scores for clusters as an indirect means in it ra op tb nk in g do cas el in e baselines no clusters (d<->d) do cIn flu x do cAu th k R an ge Pa in it ra op tb nk in g do cas el in e do cIn flu x do cAu th k R an ge Pa with clusters (c->d): GAIN with clusters (c->d): LOSS in it ra op tb nk in g do cas el in e do cIn flu x do cAu th k R an ge Pa AP k =5 39.2 48.7 c 49.5 c k =10 38.8 47.6 c 47.2 c TREC8 k =5 k =10 39.6 48.0 50.8 40.6 43.8 46.6 k =5 WSJ k =10 37.0 48.0 c 49.0 c c 44.0 51.2 c 53.6 c Table 3: Average relevant-document percentage within the top-ranked cluster. k: cluster size. Bold: best results per column. c: result differs signifi[µ] cantly from that of clust-pc (q ), used in [25]. 5.3 Further Analysis Authorities versus hubs. So far, we have only considered utilizing the authority scores that the HITS algorithm produces. The chart below shows the effect of ranking entities by hub scores instead. Specifically, the "documents?" column compares doc-Auth[cd] (i.e., ranking documents by authoritativeness) to doc-Hub[dc] (i.e., ranking documents by hubness); similarly, the "clusters?" column compares clust-Auth[dc] to clust-Hub[cd]. Each entry depicts, in descending order of performance (except for the one indicated tie) as one moves left to right, those central- 88 init. ranking opt. baselines [µ] clust-pc (q ) clust-Influx[dc] clust-PageRank[dc] clust-Auth[dc] prec@5 .457 .465 .448 .511 c .493 .533 ioc AP prec@10 .432 .439 .418 .479 c .475 c .478 c MRR .596 .635 .549 io .619 .595 .651 prec@5 .500 .512 .500 .524 .496 .532 TREC8 prec@10 .456 .464 .432 .478 .444 .460 MRR .691 .696 .723 .681 .683 .714 prec@5 .536 .560 .504 o .568 .528 .552 c WSJ prec@10 .484 .494 .454 io .512 c .490 c .478 MRR .748 .772 .680 .760 .736 .757 c Table 2: Cluster-based re-ranking. Bold: best results per column. Symbols i, o, c: results differ significantly [µ] from the initial ranking, optimized baseline, or (for the re-ranking algorithms) clust-p c (q ) [25], respectively. ity scoring functions that lead to an improvement over the initial ranking: A stands for "authority" and H for "hub". Cases in which the improvement is significant are marked with a `*'. uation metrics), doc-Auth[dd] achieved better results using w t[] edge weights. However, we cannot discount the possibility that the performance differences might be due simply to the inclusion of the extra interpolation-parameter . Moreover, in all but one case, the improved results were still below those for doc-PageRank[dd] (and always lagged When do we improve the initial ranking behind those of doc-Auth[cd]). by measuring the centrality of: Interestingly, the situation is qualitatively different if we documents? clusters? consider cd graphs instead. In brief, we applied a smoothprec @5 A H A H ing scheme analogous to that described above, but only to AP prec @10 A H AH edges leading from a left-hand node (cluster) to a right-hand MRR AH A node (document)9 ; we thus preserved the one-way biparprec @5 AH AH tite structure. Only in two of the nine evaluation settings TREC8 prec @10 HA did this change cause an increase in performance of docH H A MRR Auth[cd] over the results attained under the original edgeprec @5 AH AH (tie) weighting scheme, despite the fact that the re-weighting inWSJ prec @10 AH H volves an extra free parameter. Thus, while we have alMRR HA ready demonstrated in previous sections of this paper that We see that in many cases, hub-based re-ranking does information about document-cluster similarity relationships yield better performance than the initial ranking. But authority- is very valuable, the results just mentioned suggest that such based re-ranking appears to be an even better choice overall. information is more useful in "raw" form. son of doc-Auth[dd] against doc-PageRank[dd]. As the notation suggests, this corresponds to running HITS and PageRank on the same graph, dd. But an alternative interpretation [18] is that non-smoothed (or no-random-jump) PageRank, as expressed by Equation (3), is applied to a different version of dd wherein the original edge weights w t(u v ) have been smoothed as follows: w t[] (u v ) = def HITS on PageRank-style graphs. Consider our compari- Re-anchoring to the query. In previous work, we showed that PageRank centrality scores induced over documentbased graphs can be used as a multiplicative weight on document query-likelihood terms, the intent being to cope with cases in which centrality in Dinit and relevance are not strongly correlated [18]. Indeed, employing this technique on the AP, TREC8, and WSJ corpora, prec@5 increases from .519, .524 and .536, to .531, .56 and .572 respectively. The same modification could be applied to the cd-based algorithms, although it is not particularly well-motivated in the HITS case. While PageRank scores correspond to a stationary distribution that could be loosely interpreted as a prior [18], in which case multiplicative combination with query likelihood is sensible, it is not usual to assign a probabilistic interpretation to hub or authority scores. Nonetheless, for the sake of comparison completeness, we applied this idea to the doc-Auth[cd] algorithm, yielding the following performance changes: from .541, .544, and .564 to .537, .572 and .572 respectively. These results are still as good as -- and for two corpora better than -- those for PageRank as a multiplicative weight on query likelihood. Thus, it may be the case that centrality scores induced over a document-based graph are more effective as a multiplicative bias on query-likelihood than as direct representations of rel9 In the one-way bipartite case, the "|V |" in Equation (7) must be changed to the number of right-hand nodes. 1- w t(u v ) + |V | out(u) (7) (we ignore nodes with no positive-weight out-edges to simplify discussion, and omit the dd superscripts for clarity). How does HITS perform on document-to-document graphs that are "truly equivalent", in the sense of employing the above edge-weighting regime, to those that PageRank is applied to? One reason this is an interesting question is that HITS assigns scores of zero to nodes that are not in the graph's largest connected component (with respect to positive-weight edges, considered to be bi-directional). Notice that the original graph may have several connected components, whereas utilizing w t[] ensures that each node has a positive-weight directed edge to every other node. Additionally, the re-weighted version of HITS has provable stability properties [27]. We found that in nearly all of our evaluation settings for document-to-document graphs (three corpora × three eval- 89 evance in Dinit (see also [18]); but, modulo the caveat above, it seems that when centrality is induced over cluster-based one-way bipartite graphs, the correlation with relevance is much stronger, and hence this kind of centrality serves as a better "bias" on query-likelihood. 6. CONCLUSION We have shown that leveraging the mutually reinforcing relationship between clusters and documents to determine centrality is very beneficial not only for directly finding relevant documents in an initially retrieved list, but also for finding clusters of documents from this list that contain a high number of relevant documents. Specifically, we demonstrated the superiority of clusterdocument bipartite graphs to document-only graphs as the input to centrality-induction algorithms. Our method for finding "authoritative" documents (or clusters) using HITS over these bipartite graphs results in state-of-the-art performance for document (and cluster) re-ranking. Acknowledgments We thank Eric Breck, Claire Cardie, Oren Etzioni, Jon Kleinberg, Art Munson, Filip Radlinski, Ves Stoyanov, Justin Wick and the anonymous reviewers for valuable comments. This paper is based upon work supported in part by the National Science Foundation under grant no. IIS-0329064 and an Alfred P. Sloan Research Fellowship. Any opinions, findings, and conclusions or recommendations expressed are those of the authors and do not necessarily reflect the views or official policies, either expressed or implied, of any sponsoring institutions, the U.S. government, or any other entity. 7. REFERENCES [1] J. Balinski and C. Danilowicz. Re-ranking metho d based on ´ inter-do cument distances. Information Processing and Management, 41(4):759­775, 2005. [2] D. Beeferman and A. L. Berger. Agglomerative clustering of a search engine query log. In Proceedings of KDD, pages 407­416, 2000. [3] S. Brin and L. Page. The anatomy of a large-scale hyp ertextual web search engine. In Proceedings of the 7th International World Wide Web Conference, pages 107­117, 1998. [4] W. B. Croft. A mo del of cluster searching based on classification. Information Systems, 5:189­195, 1980. [5] W. B. Croft and J. Lafferty, editors. Language Modeling for Information Retrieval. Numb er 13 in Information Retrieval Bo ok Series. Kluwer, 2003. [6] D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/Gather: A cluster-based approach to browsing large do cument collections. In 15th Annual International SIGIR, pages 318­329, Denmark, June 1992. [7] C. Danilowicz and J. Balinski. Do cument ranking based up on ´ Markov chains. Information Processing and Management, 41(4):759­775, 2000. [8] I. Dhillon. Co-clustering do cuments and words using bipartite sp ectral graph partitioning. In Proceedings of the Seventh ACM SIGKDD Conference, pages 269­274, 2001. [9] F. Diaz. Regularizing ad ho c retrieval scores. In Proceedings of the Fourteenth International Conference on Information and Know ledge Managment (CIKM), pages 672­679, 2005. [10] G. Erkan. Language mo del based do cument clustering using random walks. In Proceedings of HLT/NAACL, 2006. [11] G. Erkan and D. R. Radev. LexRank: Graph-based lexical centrality as salience in text summarization. Journal of Artificial Intel ligence Research, 22:457­479, 2004. [12] A. Griffiths, H. C. Luckhurst, and P. Willett. Using interdo cument similarity information in do cument retrieval systems. Journal of the American Society for Information Science (JASIS), 37(1):3­11, 1986. Reprinted in Karen Sparck Jones and Peter Willett, eds., Readings in Information Retrieval, Morgan Kaufmann, pp. 365­373, 1997. [13] M. A. Hearst and J. O. Pedersen. Reexamining the cluster hyp othesis: Scatter/Gather on retrieval results. In Proceedings of SIGIR, 1996. [14] N. Jardine and C. J. van Rijsb ergen. The use of hierarchic clustering in information retrieval. Information Storage and Retrieval, 7(5):217­240, 1971. [15] Y. Karov and S. Edelman. Similarity-based word sense disambiguation. Computational Linguistics, 24(1):41­59, 1998. [16] J. Kleinb erg. Authoritative sources in a hyp erlinked environment. In Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 668­677, 1998. Extended version in Journal of the ACM, 46:604­632, 1999. [17] O. Kurland and L. Lee. Corpus structure, language mo dels, and ad ho c information retrieval. In Proceedings of SIGIR, pages 194­201, 2004. [18] O. Kurland and L. Lee. PageRank without hyp erlinks: Structural re-ranking using links induced by language mo dels. In Proceedings of SIGIR, pages 306­313, 2005. [19] O. Kurland, L. Lee, and C. Domshlak. Better than the real thing? Iterative pseudo-query pro cessing using cluster-based language mo dels. In Proceedings of SIGIR, pages 19­26, 2005. [20] J. D. Lafferty and C. Zhai. Do cument language mo dels, query mo dels, and risk minimization for information retrieval. In Proceedings of SIGIR, pages 111­119, 2001. [21] A. N. Langville and C. D. Meyer. Deep er inside PageRank. Internet Mathematics, 2005. [22] A. Leuski. Evaluating do cument clustering for interactive information retrieval. In Proceedings of the Tenth International Conference on Information and Know ledge Managment (CIKM), pages 33­40, 2001. [23] A. Leuski and J. Allan. Evaluating a visual navigation system for a digital library. In Proceedings of the Second European conference on research and advanced technology for digital libraries (ECDL), pages 535­554, 1998. [24] G.-A. Levow and I. Matveeva. University of Chicago at CLEF2004: Cross-language text and sp oken do cument retrieval. In Proceedings of CLEF, pages 170­179, 2004. [25] X. Liu and W. B. Croft. Cluster-based retrieval using language mo dels. In Proceedings of SIGIR, pages 186­193, 2004. [26] R. Mihalcea and P. Tarau. TextRank: Bringing order into texts. In Proceedings of EMNLP, pages 404­411, 2004. Poster. [27] A. Y. Ng, A. X. Zheng, and M. I. Jordan. Stable algorithms for link analysis. In Proceedings of SIGIR, pages 258­266, 2001. [28] J. Otterbacher, G. Erkan, and D. R. Radev. Using random walks for question-fo cused sentence retrieval. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 915­922, 2005. [29] J. M. Ponte and W. B. Croft. A language mo deling approach to information retrieval. In Proceedings of SIGIR, pages 275­281, 1998. [30] S. E. Preece. Clustering as an output option. In Proceedings of the American Society for Information Science, pages 189­190, 1973. [31] C. Shah and W. B. Croft. Evaluating high accuracy retrieval techniques. In Proceedings of SIGIR, pages 2­9, 2004. [32] X. Shen and C. Zhai. Active feedback in ad ho c information retrieval. In Proceedings of SIGIR, pages 59­66, 2005. [33] T. Tao, X. Wang, Q. Mei, and C. Zhai. Language mo del information retrieval with do cument expansion. In Proceedings of HLT/NAACL, 2006. [34] A. Tombros, R. Villa, and C. van Rijsb ergen. The effectiveness of query-sp ecific hierarchic clustering in information retrieval. Information Processing and Management, 38(4):559­582, 2002. [35] C. J. van Rijsb ergen. Information Retrieval. Butterworths, second edition, 1979. [36] P. Willett. Query sp ecific automatic do cument classification. International Forum on Information and Documentation, 10(2):28­32, 1985. [37] O. Zamir and O. Etzioni. Web do cument clustering: a feasibility demonstration. In Proceedings of SIGIR, pages 46­54, 1998. [38] C. Zhai and J. D. Lafferty. A study of smo othing metho ds for language mo dels applied to ad ho c information retrieval. In Proceedings of SIGIR, pages 334­342, 2001. [39] B. Zhang, H. Li, Y. Liu, L. Ji, W. Xi, W. Fan, Z. Chen, and W.-Y. Ma. Improving web search results using affinity graph. In Proceedings of SIGIR, pages 504­511, 2005. 90