SIGIR 2007 Proceedings

Demonstration

Geographic Ranking for a Local Search Engine
Tony Abou-Assaleh
GenieKnows.com 1567 Argyle St., Halifax, Nova Scotia, Canada

Weizheng Gao
GenieKnows.com 1567 Argyle St., Halifax, Nova Scotia, Canada

taa@genieknows.com

wgao@genieknows.com

Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval

General Terms
Algorithms, Design

Keywords
geographic, local, focused, ranking, crawling, search

1.

OVERVIEW

Traditional ranking schemes of the relevance of a Web page to a user query in a search engine are less appropriate when the search term contains geographic information. Often, geographic entities, such as addresses, city names, and location names, appear only once or twice in a Web page, and are typically not in a heading or larger font. Consequently, an alternative ranking approach to the traditional weighted tf*idf relevance ranking is need. Further, if a Web site contains a geographic entity, it is often the case that its in- and out-neighbours do not refer to the same entity, although they may refer to other geographic entities. We present a local search engine that applies a novel ranking algorithm suitable for ranking Web pages with geographic content. We describe its ma jor components: geographic ranking, focused crawling, geographic extractor, and the related web-sites feature. Geographic Ranking: Geographic ranking is an off-line link-based ranking that is combined with query-dependent score to determine the final ranking of the search results. A link graph is created with two types of nodes: pages and geographic entities. An edge exists between a page node and a geographic node if the Web page contains within its text the geographic entity. An iterative algorithm is applied that alternates between computing scores for the page nodes and the geographic nodes. At each iteration, the scores of geographic nodes are computed based on the scores of the page nodes linking to them, effectively giving a higher score to more popular geographic entities. The page node scores are then computed based on the scores of the geographic entities that are linked to by the page nodes, effectively giving a higher score to pages that contain popular geographic enCopyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007.

tities. The process is repeated until convergence is achieved or the maximum number of iterations is reached. Fo cused Crawling: We developed a dynamic, distributed, focused geographic crawler [1] that is currently able to download 20 million pages a day, and is easily expandedable by adding more crawling nodes (servers). A spam detection module based on the work of Ntoulas et al. [2] remove many spam pages as they are encountered. The crawler is dynamic in the sense that while it is running as a distributed multi-node system, new URLs can be added to the queue, the black list can by modified on the fly, and crawling nodes can be added and removed. We developed a set of tools that allows us to monitor and control the crawler in real-time. When URLs are added to the crawler queue, they are ranked based on the probability that target Web pages may contain a geographic entity. URLs containing a location name are ranked the highest, followed by URLs ranked by a na¨ve Bayes classifier. i Geographic Extractor: In addition to extracting typical features of a Web site, such as hyperlinks, title, headings, and keywords, we remove duplicates and extract geographic entities. A geographic entity may include a street number, street name, city name, state or province name, country, zip or postal code, and a telephone number. Currently, we are able to parse Canadian and US geographic entities. Related Web Sites: When a user searches our business directory, we provide a list of related Web sites. These Web sites are retrieved from our indices based on the business name and location, and are ranked using the geographic ranking described above.

2. REFERENCES
[1] W. Gao, H. C. Lee, and Y. Miao. Geographically focused collaborative crawling. In WWW '06: Proceedings of the 15th international conference on World Wide Web, pages 287­296, New York, NY, USA, 2006. ACM Press. [2] A. Ntoulas, M. Na jork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In WWW '06: Proceedings of the 15th international conference on World Wide Web, pages 83­92, New York, NY, USA, 2006. ACM Press.

911