SIGIR 2007 Proceedings Poster Using Similarity Links as Shortcuts to Relevant Web Pages Center for Intelligent Information Retrieval Depar tment of Computer Science University of Massachusetts Amherst {smucker, allan}@cs.umass.edu Mark D. Smucker and James Allan ABSTRACT Successful navigation from a relevant web page to other relevant pages dep ends on the page linking to other relevant pages. We measured the distance to travel from relevant page to relevant page and found a bimodal distribution of distances p eaking at 4 and 15 hops. In an attempt to make it easier to navigate among relevant pages, we added content similarity links to pages. With these additional links, significantly more relevant documents were close to each other. A browser plug-in or other tool that provides links to pages similar to a given page should increase the ability of web users to find relevant pages via navigation. Categories and Sub ject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval General Terms: Exp erimentation Keywords: Web, shortest paths, cluster hyp othesis, relevance feedback, content and link similarity, web graph 2. EXPERIMENTS We ran our exp eriments using the wt10g TREC web collection. Sob oroff [5] has shown the wt10g collection to have structural characteristics similar to the web. We used the TREC 2001 web ad-hoc topics numb ered 501-550. We compared the web graph with two augmented versions of the graph. For each topic, we augmented the graph by adding 10 out-links to each relevant document. In the first case, we added links to the 10 most content similar documents. This case corresp onds to our envisioned browser plug-in that provides a list of the 10 web pages most similar to the current page. In the second case, we added 10 random links. This case allows us to make sure that the mere addition of links does not make relevant documents closer to each other on the web graph. For each topic, we measured the shortest path distance from each relevant document to all other documents. Traversing a link or one "hop" is a distance of 1. We did not add any out-links to the non-relevant documents. We think that searchers would only utilize a feature providing these similarity links when they are looking for pages similar to a current relevant web page. In effect, we say the links do not exist on non-relevant pages b ecause we do not b elieve that users will utilize the feature on nonrelevant pages. In addition, it is unlikely that non-relevant documents can b e used to find relevant documents based on content similarity. Because we measure shortest paths, the distance from a relevant document to another relevant document is an upp er b ound on path length. Augmenting non-relevant documents with additional links could only have shortened the path lengths. On the other hand, if we had augmented all documents with additional out-links, the non-relevant documents would b e closer than we rep ort. We computed overall averages by first averaging the measurements for a topic's relevant documents and then averaging all the topics. We constructed the web graph using the wt10g out links file. To compute the document to document content similarity, we created a maximum likelihood estimated model of each document. We truncated each model to consist of only the document's 50 most probable terms. Using this model, we measure the similarity of the other documents using the KL-divergence. We used Dirichlet prior smoothing and set its parameter to 1500. We stemmed using the Krovetz stemmer and used an in-house list of 418 stop words. We used the Lemur toolkit for our exp eriments. 1. INTRODUCTION When a user finds a web page that the user considers relevant and wants to find other relevant web pages, a seemingly reasonable approach the user can take is to follow the page's links to other pages. The degree to which this approach is reasonable dep ends on the extent to which the well-known cluster hyp othesis is true on the web when the similarity measure is the distance to navigate from one document to another using hyp erlinks. We first investigated the extent to which the cluster hyp othesis is true on the web graph, and then we attempted to improve the navigability of the web with the automatic addition of links to similar web documents. If our automatically created links make relevant documents easier to find, such a capability could b e packaged into some sort of browser plug-in that allows the user to request a list of web pages similar to the current web page. We found that: · Relevant documents are either within distance 5 of another relevant page or are as likely to b e reached at greater distances as non-relevant documents. · The automatic addition of content similarity hyp erlinks can significantly increase the numb er of relevant documents reachable from a given relevant document. Copyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. 863 SIGIR 2007 Proceedings Poster Average Number of Relevant Documents Web plus 10 Similarity Links Web plus 10 Random Links Web Only documents closer but do not help at short distances. Thus, the gains obtained by similarity linking app ear to result from relevant documents b eing more similar to each other than simply from increasing the graph connectivity. 3 4 4. RELATED WORK Allan [1] studied the automatic construction of hyp ertext with a focus on the typing of links while here we make no attempt to address the issue of helping a user choose among links. We have previously studied using a single document as query to find similar pages [4], but here our emphasis has b een on understanding the distribution of relevant and non-relevant documents on the web graph with and without the addition of similarity links. Chakrabarti et al. [2] studied the rate at which topics dissipate on the web graph, while we looked at the related ­ but different ­ notion of distance b etween relevant pages. Our results are in line with Menczer's prediction that one would b e able to infer relevance of a page at a maximum of around 4 to 5 hops [3]. Vassilvitskii and Brill [6] used distance on the web graph to p erform a reranking of search results given that relevant documents link to other relevant documents. For each top ranked search result, they p erformed a limited breadth first search and found that searching to a distance of 4 resulted in the b est p erformance. Our results explain their finding by showing that relevant documents are found within a distance of 5 or are as likely to b e found as non-relevant documents. 0 1 2 0 5 10 15 20 25 30 35 Distance from Relevant Document 80 0 00 Average Number of Non-Relevant Documents Web plus 10 Random Links Web plus 10 Similarity Links Web Only 0 20000 40000 6000 0 0 5 10 15 20 25 30 35 Distance from Relevant Document 5. CONCLUSION We found a bimodal distribution for the distance of relevant documents to each other on the web graph. Relevant documents are within a distance of 5 of each other or as likely to b e reached as non-relevant documents. The automatic addition of 10 content similarity links brings significantly more relevant documents close to each other. Augmenting hyp ertext with similarity links should aid the searcher attempting to navigate from a relevant document to other relevant documents. Figure 1: The distance of relevant and non-relevant documents from relevant documents. 3. RESULTS AND DISCUSSION Figure 1 shows the distribution of relevant and non-relevant documents as a function of their web graph distance from relevant documents. The "Web Only" distribution of relevant documents shows a clear bimodal shap e p eaking at distances of 4 and 15. The distribution of non-relevant documents is unimodal p eaking around 17. The relevant documents reached at distances greater than 6 are most likely reached simply as a result of the interconnectedness of the web and are no easier to reach than non-relevant documents. The p eak of relevant documents at distances less than 6 demonstrate that the cluster hyp othesis is true on the web graph at least for some relevant documents. The addition of 10 content similar links brings a significant numb er of relevant documents closer to the average relevant document. A similar p eaking at a distance of 4 shows that the relevant documents reached using the similarity links are likely indep endent of the ones b eing found using the existing links. With existing web links, on average 0.19 relevant documents are within distance 1 from a relevant document. With the addition of 10 content similarity links, an average of 2.26 relevant documents are distance 1 from a relevant document. The p erformance b oost continues for the larger distances. At distance 4, the web-only distribution has an average 1.73 relevant documents and the web plus similarity links has 3.98. The addition of 10 random links results in no significant gains at distance 5 or less. The additional links do bring Acknowledgments This work was supp orted in part by the Center for Intelligent Information Retrieval and in part by the Defense Advanced Research Pro jects Agency (DARPA) under contract numb er HR0011-06-C-0023. Any opinions, findings and conclusions or recommendations expressed in this material are the authors' and do not necessarily reflect those of the sp onsor. 6. REFERENCES [1] J. Allan. Building hypertext using information retrieval. IPM, 33(2):145­159, 1997. [2] S. Chakrabarti, M. M. Joshi, K. Punera, and D. M. Pennock. The structure of broad topics on the web. In WWW '02. ACM Press, 2002. [3] F. Menczer. Lexical and semantic clustering by web links. JASIST, 55(14):1261­1269, 2004. [4] M. D. Smucker and J. Allan. Find-similar: Similarity browsing as a search tool. In SIGIR '06, pages 461­468. ACM Press, 2006. [5] I. Soboroff. Do TREC web collections look like the web? SIGIR Forum, 36(2):23­31, 2002. [6] S. Vassilvitskii and E. Brill. Using web-graph distance for relevance feedback in web search. In SIGIR '06, pages 147­153. ACM Press, 2006. 864