SIGIR 2007 Proceedings Poster Automatic Classification of Web Pages into Bookmark Categories Chris Staff and Ian Bugeja University of Malta, Depar tment of Computer Science and AI, Malta cstaff@cs.um.edu.mt, ian@iannet.net ABSTRACT We describe a technique to automatically classify a web page into an existing bookmark category to help a user to bookmark a page. HyperBK compares a bag-of-words representation of the page to descriptions of categories in the user's bookmark file. Unlike default web browser dialog boxes in which the user may be presented with the category into which he or she saved the last bookmarked file, HyperBK also offers the category most similar to the page being bookmarked. The user can also opt to create a new category; or save the page elsewhere. In an evaluation, the user's preferred category was offered on average 61% of the time. third-party search engine to track bookmarked Web pages that have changed location. A user can be reminded of the query that had been used to find a web page before it was bookmarked, or HyperBK can suggest a query to use to find web pages similar to a category of bookmarked web pages. HyperBK provides a variety of views to potentially make it easier for a user to find an entry in the list of recently visited web pages. Finally, HyperBK automatically recommends a bookmark category into which to store a web page to be bookmarked. This last feature is the sub ject of this paper. Similar systems are reported in Sect. 2. The web page classification algorithm is describ ed in Sect. 3, and results of the evaluation are presented in Sect. 4. We give our future work and conclusion in Sect. 5. Categories and Subject Descriptors H.3.3 [Clustering]: Information Search and Retrieval 2. SIMILAR SYSTEMS Bookmark management systems are usually offered as standalone systems - unlike HyperBK, none is integrated into a browser [2], and as discussed in our earlier work [2], web browsers offer minimal bookmark management facilities. Most systems do not offer automatic web page classification features, although Abrams, Baecker, and Chignell [1] list some requirements for bookmark management systems. Among their requirements are improving the organisation of bookmarks on behalf of the user, possibly by automatically "filing" new bookmark entries, and integrating the bookmark management system with a web browser. Feng and Brner [4] use "semantic treemaps" to categorise bookmark entries. Li and Yamanishi [7] use a "finite mixture model" to classify documents, but this requires the prior existence and standard description of categories in which to place documents. On the other hand, Shen, et. al. [9] use a page summary on which to base a classification. General Terms Algorithms Keywords Automatic Classification, Bookmarks, Web Browsers 1. INTRODUCTION Bookmark management systems that can help classify bookmarked web pages, track web pages that have moved since they were bookmarked, help a user to find web pages similar to pages that were bookmarked, and that generally assist with their own organisation are becoming increasingly important. Recent surveys indicate that a user's bookmark file contains on average 184 entries [3], and that approximately 73.7% of pages visited are page revisits [5], with interaction through either a bookmark file, or the history list of recently visited sites, or the browser's back button being the most common ways of revisiting a page. Web browsing software, such as Mozilla Firefox and Microsoft Internet Explorer, provide only limited support for automatic management of bookmarked web pages [6], and even less support is provided for navigating through the list of recently visited web pages to enable a user to return to a recently visited page [5]. Hyp e rBK [2] addresses some of the issues. HyperBK is implemented as a Firefox extension and utilises a Copyright is held by the author/owner(s). SIGIR'07, July 23­27, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007. 3. CLASSIFYING WEB PAGES TO BE BOOKMARKED We parse each accessed web page using its Document Object Model (DOM) to extract the text components. We remove stop words, HTML and JavaScript tags, and the remaining words are stemmed using the Porter Stemmer [8]. The five stems with the highest term frequency are selected to represent the page, but only if they have a frequency of at least two, otherwise only three stems are selected. This helps to keep down computational costs. If a web document contains META keywords, then the five META keywords that occur most frequently in the document are used instead. 731 SIGIR 2007 Proceedings Poster similar categories, and the other had many categories most of which contained unrelated bookmarks. 5. CONCLUSION AND FUTURE WORK Automatic b o okmark file classification could b e a useful extension to web browsers. Instead of just offering the last category used to store a bookmark, or dumping the newly created bookmark into a default location, HyperBK recommends a category based on a simple matching algorithm, which has been extended to consider the domain names of previously bookmarked pages and keyword extraction from titles. In an experiment, 61% of bookmarks were classified correctly. Next, we intend to modify the approach to keyword selection and category representation. First, we will segment a document into its component topics, and extract keywords from the topic most likely to be relevant to the user. A Web page is likely to contain information about more than one topic, but it is unlikely that the user has bookmarked the page because of an interest in all of its topics. Representing a category and a document to be bookmarked using only keywords that occur in topics of interest to the user may help to improve precision. Table 1: Classification Evaluation Results (Legend: `BKs' = Total Bookmarks; `Cat.' = Total Categories; `Hits' = bookmarks allocated into correct category; `Misses' = b o okmarks allo cated wrongly; `Near Hits' = bookmark allocated to a parent category (excluding the Bookmarks Root); `Approx Precision' = Near Hits+Hits/total; `Precision' = Hits/total.) The algorithm used to select the candidate category is based primarily on simple keyword matches. As was described above, each time a web page is accessed, representative keywords are extracted and stored. If the page is bookmarked into a category, these terms are added to the set of terms that represent the category. The recommended category is the category that has the greatest number of keyword matches for the incoming web page. If this fails, then the recommended category will be the category that contains another page from the same domain, as long as the category and page share at least one keyword in common. Finally, if even this fails, then the page title is compared to the category descriptions. The category with the highest number of term matches is recommended. 6. REFERENCES [1] D. Abrams, R. Baecker, and M. Chignell. Information archiving with bookmarks: personal Web space construction and organization. In CHI '98: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 41­48, New York, NY, USA, 1998. ACM Press/Addison-Wesley Publishing Co. [2] I. Bugeja. Managing WWW browser's bookmarks and history (a Firefox extension). Final year pro ject report, Department of Computer Science & AI, University of Malta, 2006. [3] A. Cockburn and B. McKenzie. What do web users do? an empirical analysis of web use. Int. J. Hum.-Comput. Stud., 54(6):903­922, 2001. [4] Y. Feng and K. Brner. Using semantic treemaps to categorize and visualize bookmark files. In Proceedings of SPIE - Visualization and Data Analysis, volume 4665, pages 218­227, January 2002. [5] E. Herder. Forward, Back, and Home Again Analysing User Behavior on the Web. PhD thesis, University of Twente, 2005. [6] W. Jones, H. Bruce, and S. Dumais. Keeping found things found on the web. In CIKM '01: Proceedings of the tenth international conference on Information and know ledge management, pages 119­126, New York, NY, USA, 2001. ACM Press. [7] H. Li and K. Yamanishi. Document classification using a finite mixture model. In Proceedings of the 35th annual meeting on Association for Computational Linguistics, pages 39­47, Morristown, NJ, USA, 1997. Asso ciation for Computational Linguistics. [8] M. F. Porter. An algorithm for suffix stripping. Readings in information retrieval, pages 313­316, 1997. [9] D. Shen, Z. Chen, Q. Yang, H.-J. Zeng, B. Zhang, Y. Lu, and W.-Y. Ma. Web-page classification through summarization. In SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 242­249, New York, NY, USA, 2004. ACM Press. 4. EVALUATION We collected real user's bookmark files to see if HyperBK would assign bookmarked pages to categories in the same way that the users did. We require bookmark files to be organised (and to contain some categories), and we assume that the user has assigned each bookmarked page to the correct category. This is a weak assumption but we had insufficient time to conduct a longitudinal study. Students following the BSc IT (Hons) degree programme at the University of Malta were invited by e-mail to submit their bookmark files. Of approximately 200 students contacted, 30 submitted their bookmark files (a return of about 15%). Of these, 22 files were considered inappropriate for use because they did not contain more than one or two categories, and we felt that including them in the evaluation could unfairly bias the results in HyperBK's favour. We randomly removed 10 URLs from categories of 5 of the remaining 8 bookmark files. We removed less than 10 URLs from the other three: in two cases because there were too few categories overall and in the third case (79231 in Table 1) because although there were many categories, most of them contained very few bookmarks. The challenge was to place the randomly chosen bookmarks into the same categories selected by the users. The results are given in Table 1. We would hope for a generally high precision, perhaps dropping slightly as the number of categories grows, especially if categories become less distinguishable from each other (because only 5 terms are selected to describe a page). For two bookmark files containing 38 and 45 categories, precision drops to below 0.7, which is probably unacceptably low. However, one of the two bookmark files contains many 732