SIGIR 2007 Proceedings

Demonstration

Focused Ranking in a Vertical Search Engine
Philip O'Brien
GenieKnows.com 1567 Argyle St., Halifax, Nova Scotia, Canada

Tony Abou-Assaleh
GenieKnows.com 1567 Argyle St., Halifax, Nova Scotia, Canada

pobrien@genieknows.com

taa@genieknows.com

Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval

General Terms
Algorithms, Design

Keywords
topic, focused, ranking, crawling, search

1.

OVERVIEW

Since the debut of PageRank and HITS, hyp erlink-induced Web document ranking has come a long way. The Web has b ecome increasingly vast and topically diverse. Such vastness has led many into the area of topic-sensitive ranking [1] and its variants. We address the high dimensionality of the Web by providing tools for focused search. A focused search engine is one which seeks coverage over a subset of topics of the Web and presents users with relevant search results in a known domain. This demonstration will introduce readers to the GenieKnows.com Vertical Search Engine. We present the ma jor comp onents of our vertical search engine including: (1) a patent-p ending focused Web crawler and (2) a novel hyp erlink-induced focused ranking algorithm. Focused Crawling: The crawler fetches a Web page only if its na¨ve Bayesian classifier determines that the page may i b elong to the category of interest with a high probability. The result is a collection of documents which have a high probability of b eing relevant within a particular vertical, while reducing bandwidth and storage requirements. Focused Ranking: Augmenting focused crawling, documents in a category Ci are classified into subtopics, i . We obtain these subtopics from the second- and third-level categories of the DMOZ sub-directory for Ci . A topic-inclusion probability is computed for each document-topic pair. The probability of d b eing on topic t  i is denoted P (t|d). A FocusedRank index is generated by first computing an adjacency matrix, M , from the link graph of Ci where a link is interpreted to exist only if a page u hyp erlinks to a page v and the two contain at least one common topic. The topicinclusion score, T (u, v ), stored in M is the normalized sum of products of the probabilities of shared topics b etween u and v . That is, after normalizing the probabilities for u,
Copyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007.

Figure 1: System architecture t Muv = T (u, v ) = i P (t|u) · P (t|v ). Traditional linkbased, iterative ranking is p erformed on M until convergence is achieved. This architecture is depicted in Figure 1. The search engine is fronted by a simple, familiar interface which displays search results, category subtopics, and the numb er of search results b elonging to each. Following a user study we compared the mean average score and mean average precision of FocusedRank to that of PageRank [2] and Topic-Sensitive PageRank [1] (TSPR). Two-tailed t-tests show a significant accuracy increase over PageRank for b oth measures (p < 0.02) and equivalent accuracy to TSPR with significantly reduced storage and computation requirements.

2. REFERENCES
[1] T. H. Haveliwala. Topic-sensitive PageRank. In 11th International World Wide Web Conference, pages 517­526, New York, NY, USA, May 2002. [2] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical rep ort, Standford University, 1999.

912