WWW 2007 / Poster Paper

Topic: XML

Adaptive Record Extraction From Web Pages
Justin Park
University of Calgary 2500 University DR NW Calgary, AB, Canada

Denilson Barbosa
University of Calgary 2500 University DR NW Calgary, AB, Canada

parkj@cpsc.ucalgar y.ca

denilson@ucalgar y.ca

ABSTRACT
We describ e an adaptive method for extracting records from web pages. Our algorithm combines a weighted tree matching metric with clustering for obtaining data extraction patterns. We compare our method exp erimentally to the stateof-the-art, and show that our approach is very comp etitive for rigidly-structured records (such as product descriptions) and far sup erior for loosely-structured records. (such as entries on blogs).
HTM L document

s t ep1 pr epr o ce s s

s t e p2 f i nd similiar no de s

s t ep3
propose pattern compositions

pat t e r ns s t e p5 e xtr ac t data

re f i ne pattern co mpo s it i o n s t ep4

Categories and Subject Descriptors
H.2.4 [Database Management]: Textual Databases; H.3.3 [Information Search and Retrieval]: Clustering
data set

c ho os e the bes t pattern

s t ep6

General Terms
Algorithms, Exp erimentation.

pat t e r n

dat a

INTRODUCTION

Figure 1: Overview of our method. works is that they use tree-edit distance strictly. Doing so works well on strictly structured records, such as product descriptions, but not so well on loosely structured records, such as blog entries. In this pap er, we describ e a general purp ose web data extractor, which we call WDE, for simplicity, that p erforms well in b oth rigidly and loosely structured records in an HTML document. We validate our tool on several kinds of web pages, including product listings, search engine results pages, sp orts scoreb oards, forums, and blogs.

A substantial fraction of the web consists of the so-called deep web: pages that are dynamically generated using predefined templates p opulated with data from databases. Because these databases are maintained by organizations with vested interests in their accuracy and usefulness, the information in deep web pages tends to b e of very high-quality. However, deep web sites are intended for human consumption, much like other web sites, and do not provide access to their data to computer applications. Recently, there has b een considerable interest in building automatic tools capable of extracting data from disparate web sites and representing them in a form amenable to processing by other applications [2, 3, 4, 6, 8]. In particular, there has b een considerable work based on using the tree-edit distance [7] metric as a basis for finding data extraction patterns. Most of the previous tools are tailored to sp ecific web sites. The MDR [4] and Depta [6] approaches aim at extracting product listings or data displayed in tabular form; Zhao et al. [8] focus on extracting entries from result pages from search engines; the News Extractor by Reis et al. [5] only extracts news articles. A ma jor shortcoming of these
This work was funded in part by grants from the Natural Sciences and Engineering Council of Canada and the Alb erta Ingenuity Fund.

2. OUR METHOD
Figure 1 gives an overview of our data extractor. Given the URL of a web page, WDE automatically discovers all rep eating patterns found in the page, as follows. Initially, we use a standard tool for tidying the HTML page1 ; our preprocessing step also merges all non-structural HTML tags (e.g., those sp ecifying fonts, colors, etc.). The result of this step is a tree representation of the web page. The next step is identifying clusters of tree nodes with similar structure. As in previous work, we compute the similarity b etween two nodes in the tree by a tree matching algorithm that approximates the true tree-edit distance [7] value b etween those nodes. (This is due to the high cost in finding the actual tree-edit distances b etween all pairs of
1

Copyright is held by the author/owner(s). WWW 2007, May 812, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005.

http://people.apache.org/~andyc/neko/doc/html.

1335

WWW 2007 / Poster Paper
WDE 0 0 0 1 1 0 17 0 0 1 0 0 0 4 0 0 0 0 0 10 89.9% 100% MDR 0 0 0 0 1 0 3 0 10 1 0 0 0 0 0 0 10 13 20 0 80.1% 86.3% WDE 0 2 0 0 3 0 0 19 0 0 85.7% 100%

Topic: XML
MDR 0 11 10 5 10 7 0 61 0 0 38.1% 76.2%

Findgift.com Ebay.com Pricerunner.com Backcountry.com Download.com Shopping.yahoo.com Radioshack.com Nextag.com Indio.ca Dealtime.com del.icio.us Barnsandnoble.com Youtube.com Imdb.com allrecipes.com foodtv.ca weblog.xanga.com nhl.com calgarypubliclibrary.com mls.ca Average Recall Average Precision

10 50 25 31 9 15 0 15 10 14 10 10 20 8 20 10 10 14 20 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

10 50 25 32 9 11 17 15 0 14 0 0 20 0 20 10 0 0 0 0

0 0 0 0 0 4 0 0 0 0 0 10 0 12 0 0 0 1 0 10

forums.gentoo.org forum.java.sun.com Youtube.com operawatch.com shoutwire.com engadget.com gizmodo.com thinkprogress.org discussion.forum.nokia.com messages.yahoo.com Average Recall Average Precision

25 9 10 5 7 7 6 42 15 18

0 0 0 0 0 0 0 0 0 0

25 0 0 0 0 0 6 0 15 18

0 0 0 0 0 0 19 0 0 1

Table 2: Accuracy on discussion forums. the largest numb er of records.) Tables 1 and 2 show the results of our exp eriments. The tables show three values for each web site and each method: the numb er of correct records retrieved, the numb er of missed records, and the numb er of false positives, in this order. The tables also show the average precision and recall for each method on all web sites in each group. On product listings (Table 1), WDE showed very high precision, although it sometimes missed a few records. The reason for this is that in the refining step, outlier records that are different than the more common ones are eliminated. On the other hand, this refining step also eliminated all false p ositives, thus resulting 100% in precision. While WDE is very comp etitive with MDR with strictly structured records, it showed drastically sup erior p erformance on our second test, with loosely structured records (Table 2).

Table 1: Accuracy on product listings.

nodes in the tree.) Unlike in previous work [6], we assign different weights to nodes dep ending on their height (internal nodes get higher weights). In the third step, we enumerate all p ossible candidate records in the page. Candidate records can have one or more nodes, but cannot have more than one node from the same cluster. Also, we assume that records do not overlap. The fourth step computes the similarity of all pairs of records identified in the previous step. This is done as follows. Each record is converted into a tree whose root is a dummy node containing all nodes in that record as children. The similarity b etween two records is computed similarly as in step 2. Finally, we cluster the records based on their similarity. From the clusters obtained after step 4, we generate data extraction patterns that navigate the HTML tree and extract the actual data that form the records. At the time of writing, we rep ort to the user all patterns extracted from clusters with high pairwise similarity among records. Our future work consists of finding automatic ways of evaluating these patterns (step 6).

4. CONCLUSION
We prop osed a novel method for extracting records from web pages based on and adaptive, weighted tree matching algorithm and the clustering of similar records. Unlike previous methods, our method p erforms well for b oth strictly and loosely structured records, as confirmed by our comprehensive exp erimental evalutaion. Our future work consists of studying criteria for ranking patterns, and exp erimenting with other kinds of records.

5. REFERENCES
[1] R. Baeza-Yates, and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Welsey, 1999. [2] V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In VLDB 2001: p. 109118. [3] A. Laender, A. da Silva, B. Ribeiro-Neto, and J. Teixeira. A Brief Survey of Web Data Extraction Tools. SIGMOD Record 31(2): p. 8493. [4] B. Liu , R. Grossman , Y. Zhai. Mining Data Records in Web Pages. In KDD 2003: pg. 601606. [5] D. Reis, P. Golgher, A. Silva, and A. Laender, Automatic Web News Extraction Using Tree Edit Distance. In WWW 2004:, pp. 502511. [6] Y. Zhai, and B. Liu. Web Data Extraction Based on Partial Tree Alignment. In WWW 2005: p. 7685. [7] K. Zhang, and D. Shasha. Tree Pattern Matching. In Pattern Matching Algorithms; Oxford University Press, 1997. [8] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu. Fully Automatic Wrapper Generation for Search Engines. In WWW 2005: p. 6675.

EXPERIMENTAL EVALUATION

We compared our tool to the MDR [4] method, which is the state-of-the-art in web data extraction based on tree edit distance. We used 30 web sites in our comparison; 20 of these sites contain product listings, usually organized in tabular format; the other sites contain user comments, similar to discussion groups or blogs. We use the standard notions of precision and recall [1] to evaluate our tool. Recall is defined as the p ercentage of the intended data records that are retrieved by the tool; precision is defined as the p ercentage of the returned data records that are correct. We determine correctness (i.e., precision) manually, on a b est-effort basis. On average, our tool returned less than 5 patterns p er site. We chose the pattern that b est identified the records for the comparison b elow after a quick visual insp ection. (Almost always, the chosen pattern was the one producing

1336