are chosen to divide the web page for subsequent conversion and summarization. A Function-based Object Model (FOM) is proposed in [3]. FOM attempts to understand author's intention by identifying object functions. VIPS (VIsion-based Page Segmentation) algorithm [2] extracts the semantic structure for a web page. In VIPS, a tree structure is used to model the page. Each node corresponds to a block in a page, and has a value to indicate the Degree of Coherence. The DOM tree is analyzed from root to leaves and the DOM nodes are divided, based on their spatial layout and visual cues. [5] describes an HTML web page segmentation algorithm for dividing online medical journal articles. In [5], the web page content is modelled by a zone tree structure based on the geometric layout of the web page.
ABSTRACT
This paper presents experiments using an algorithm of web page topic segmentation that show significant precision improvement in the retrieval of documents issued from the Web track corpus of TREC 2001. Instead of processing the whole document, a web page is segmented into different semantic blocks according to visual criteria (such as horizontal lines, colors) and structural tags (such as headings ~, paragraph
). We conclude that combining visual and content layout criteria gives the best results for increasing the precision: the ranking of the page is calculated for relevant segments of pages resulting from the segmentation algorithm.
Categories and Subject Descriptors
H.3.1 [Content Analysis and Indexing]: Abstracting methodsIndexing methods; H.3.3 [Information Storage and Retrieval]
General Terms
Algorithms, Measurement, Performance, Experimentation.
Keywords
Segmentation, topic analysis, evaluation, block's coherence.
3. TOPIC SEGMENTATION
A single web page often contains multiple semantics and the different parts of the web page have different importance in that page. We suppose that there are two types of Web pages: single topic Web page and multi-topics Web pages. The contents of single topic Web page are homogeneous, while multi-topics Web pages are divided into several blocks of homogeneous contents. The textual contents of a page follow sequential organization of the topic. Topic analysis is based on boundary delimitation. In the case of flat texts, we distinguish two types of basic units: the sentence with is made up of a fixed number of words and paragraphs. However, in the case of World Wide Web, a document is composed of textual contents and HTML structure. The authors use both visual criteria like the horizontal lines, vertical lines, colors, and content layout of the page like headings
~, paragraph
and tables
tags in order to separate possible segments into different topics. The separation mode differs from an author to another, and the visual criteria and content layout tags are not used in different cases as segment delimiters, from where a major problem to segment Web pages. Consequently, the criteria of delimitation of segments are random and do not depend on specific rules to respect. We propose a solution for Web page segmentation based on evaluation of several segmentations by using a topic analysis method. Furthermore, the topic segmentation algorithm based on visual
1. INTRODUCTION
Most information retrieval systems on the Web process web pages as the smallest and undividable units of information, whereas a web page as a whole may not be appropriate to represent a single topic. A web page usually contains various contents which are not all related to the same topic. Moreover, a web page often contains multiple topics that are not necessarily semantically linked to each other. Therefore, detecting the semantic content structure of a web page could potentially improve the performance of web information retrieval. Many web applications can use the semantic content structure of web pages to improve information retrieval. Previous work uses ad hoc methods to deal with different types of web pages. If we can get the semantic content structure of the web page, wrappers could be built more easily and information could be extracted more easily.
2. RELATED WORKS
A straightforward approach for segmenting web pages is to use tag information. Usually, a small set of tags serves as segment delimiters. In [4], four types of tags, including ,
,
Copyright is held by the author/owner(s). SIGIR'07, July 2327, 2007, Amsterdam, The Netherlands. ACM 978-1-59593-597-7/07/0007.
817
SIGIR 2007 Proceedings
Poster
criteria and content layout is described as follows: different segments are extracted from each web page of the TREC collection, by using various segment delimiters appearing in the page and that have been chosen from a predefined criteria list. One solution per criterion is generated. So, the result is a set of segmentation solutions. After that, the evaluation function is applied for each solution of segmentation. The best segmentation solution is checked and the block index is created. The goal of our idea is to find a solution of segmentation based on visual criteria (lines, color) and content presentation (paragraph, subtitles) in order to extract blocks that are coherent inside their contents and for which the distance between them is great. Our contribution compared to the various segmentation algorithms that we studied before consists in dividing web pages into topic segment units. Really, our topic segmentation algorithm is a method for partitioning Web pages into coherent segment units that correspond to a sequence of sub topical passages. The algorithm assumes that a set of words are used during the course of a given subtopic discussion, and when that subtopic changes, a significant proportion of the vocabulary changes as well. With our evaluation function of segmentation solutions, we maintain only candidate delimiters for segmenting a Web page by eliminating noisy HTML tags.
Dist (bk , bk +1 ) =
1 1 = = Sim Vbk ,Vbk +1 cos Vbk ,Vbk +1
(
)
(
)
n 2 l =1 l ,bk n
w
×
n
l =1
wl2,bk +1
w
l =1
l ,bk
× wl ,bk +1
Where Vbk and Vbk+1 are block vectors of bk and bk+1 respectively. The weight of each term is calculated by using Okapi25 measure.
5. EMPIRICAL EVALUATIONS
Our experiments are based on Web Tracks of TREC 2001. We used OKAPI BM25 measure in our ranking function. We compare two categories of algorithms (DocRank and BlockRank). DocRank(P) represents the BM25 score of the page P and BlocRank(P) represents the higher BM25 score of blocks of the page P. From table 1, we can see that BlockRank performed better results than DocRank, either on MAP or P@5 or P@10 on TREC collection. For example, the result achieved 58%, 75% and 57% improvements over the DocRank algorithm on MAP, P@5 and P10 on WT10g. Table 1. Map, P@5 and P@10 comparison
DocRank Map(Means Average Precision) P@5 P@10 0,133 0,18 0,172 BlockRank 0,2112 0,316 0,27
4. SEGMENTATION EVALUATION
The segmentation evaluation function is calculated with two measures: a block's content coherence and a distance between these blocks. The block's content coherence measure is applied inside the segment, and depends on the co-occurrence between terms belonging to the same segment. This measure reflects the density of the information linked to one topic and the degree of correlation between terms of the block. The second measure is based on a similarity measure between two segment vectors. The distance between adjacent blocks allows us to locate boundaries between the dissimilar neighbouring blocks. The evaluation function is described as follow:
Coh (bi ) Dist (bk , bk +1 ) 1i nb ( P ,S j ) 1 k nb ( P ,S j )-1 SegmEvalFu nct (S j , P ) = * nb (P , S j ) nb (P , S j ) - 1
6. CONCLUSION
In this paper, we proposed a topic segmentation method which allows us to extract semantic blocks from Web pages using visual criteria and content presentation HTML tags. The topic segmentation algorithm is a method for partitioning Web pages into coherent segment units that correspond to a sequence of sub topical passages. We performed experimental evaluations of our algorithm using information retrieval test collection of TREC 9. We found that our web page topic segmentation algorithm improve information retrieval by indexing documents more precisely and by subdividing texts into thematically coherent segments. Our topic segmentation method allows to better estimate the relevance compared to the request
Where Sj is a segmentation solution of the page P based on the visual criterion j. nb(P,Sj) represents the number of blocks extracted from P according to the solution of segmentation Sj. The best segmentation solution is the one which has the greatest value of the function. Coh(b) is the coherence inside the block b which is calculated as follow:
Coh (b ) = 1 2 nt (b )
7. REFERENCES
[1] Buyukkokten, O., Garcia-Molina, H., and Paepche, A.,
Accordion Summary for End-Game Browsing on PDAs and Cellular Phones, Proc. of Conference on Human Factors in Computer Systems, 2001. Cai, D., Yu, S., Wen, J.-R., and Ma, W.-Y., Extracting Content Structure for Web Pages Based on Visual Representation, Proc. of 5th Asia Pacific Web Conference, 2003. Chen, J., Zhou, B., Shi, J., Zhang, H., and Wu, Q., FunctionBased Object Model towards Website Adaptation, Proc. 10th International World Wide Web Conference, 2001. Diao, Y., Lu, H., Chen, S., and Tian, Z., Toward Learning Based Web Query Processing, Proc. of International Conference on Very Large Databases, pp. 317-328, 2000. Jie, Z.D.L., and George, R.T., Combining DOM Tree and Geometric Layout Analysis for Online Medical Journal Article Segmentation, JCDL'06, June 1115, 2006, Chapel Hill, North Carolina, USA. Lin, S.-H., and Ho, J.-M., Discovering Informative Content Blocks from Web Documents, Proc. of ACM SIGKDD, 2002.
[2] [3] [4] [5]
t i b
t j b
Cooccurren ce (t i , t j ) Nbdoc (t i , t j ) Nbdoc (t i ) + Nbdoc (t j ) - Nbdoc (t i , t j )
with Cooccurren ce (t i , t j ) =
Where Nbdoc(t1,..,tn) represents the number of documents containing all the terms t1,..,tn and nt(b) is the number of terms of the b block. Dist(bk,bk+1) represents a distance measure between adjacent blocks. This measure is defined as follows:
[6]
818