Building Implicit Links from Content for Forum Search
Gu Xu
Microsoft Research Asia 5/F, Sigma Center, 49 Zhichun Road, Beijing (100080), China

Wei-Ying Ma
Microsoft Research Asia 5/F, Sigma Center, 49 Zhichun Road, Beijing (100080), China

guxu@microsoft.com

wyma@microsoft.com

ABSTRACT
The ob jective of Web forums is to create a shared space for open communications and discussions of specific topics and issues. The tremendous information behind forum sites is not fully-utilized yet. Most links between forum pages are automatically created, which means the link-based ranking algorithm cannot be applied efficiently. In this paper, we proposed a novel ranking algorithm which tries to introduce the content information into link-based methods as implicit links. The basic idea is derived from the more focused random surfer: the surfer may more likely jump to a page which is similar to what he is reading currently. In this manner, we are allowed to introduce the content similarities into the link graph as a personalization bias. Our method, named Finegrained Rank (FGRank), can be efficiently computed based on an automatically generated topic hierarchy. Not like the topic-sensitive PageRank, our method only need to compute single PageRank score for each page. Another contribution of this paper is to present a very efficient algorithm for automatically generating topic hierarchy and map each page in a large-scale collection onto the computed hierarchy. The experimental results show that the proposed method can improve retrieval performance, and reveal that content-based link graph is also important compared with the hyper-link graph.

1.

INTRODUCTION

Categories and Subject Descriptors
H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval; H.5.4 [Information Interfaces and Presentation]: Hypertext/Hypermedia

General Terms
Algorithms, Performance, Experimentation

Keywords
Forum Search, PageRank, Hierarchy Generation, Clustering, Categorization

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00.

Web forum, also named bulletin or message board, is an online portal for open discussions on specific topics and issues. It also can be considered as the upgraded Web version of traditional newsgroups. Not like the common Web sites that are managed by editors, innumerable new postings are created each day on Web forums that cover almost any conceivable topic. Actually, Web forums created some kind of knowledge database which took millions of internet users many years. Commercial search engines, e.g. Google [1] and Yahoo! [2], have begun to leverage a small part of valuable information gleaned from such forums. The interesting fact here is that, if one makes a query on Google regarding a specific problem such as "my camera got wet", most answers will come from forum pages. Now, the tremendous information is hidden behind a large number of forum sites, distributed on the whole internet. How to utilize the information and make it accessible for all the internet users becomes a valuable research topic. Xi [26] proposed a learning-based approach to optimize the ranking function for newsgroup search. However, Web forums, derived from newsgroups, have distinct characteristics that make the search problem different. In newsgroups, each message is sent in individual email with semi-structured headers. Web forums are composed of HTML pages. They are well-formatted for a better human readability, but not for machine parsing. Multiple postings, user profiles, and artificial links are often integrated into one page. For the variety of forums, it becomes a challenging task to automatically extract the structured information from these pages, such as splitting different postings, merging postings into a thread tree. But better than newsgroups, the link information is available in Web forums. The most useful links are "reference" links in user postings, e.g. a link preceded by "see my previous posting". Moreover, important postings, like "sticky" messages and popular threads, always can be easily reached from any other pages within several clicks. Even at the site level, there are still some cross-forum links which give strong "recommendations". Therefore in this paper, we try to utilize the link-base ranking methods to improve the search results in Web forums. Link-based ranking techniques have been proven effective for modern Web search engines. Commonly, the link-based ranking algorithms consider that the hyperlink from one page to another is some kind of recommendation. But this assumption is not always correct. Navigational (for most of in-site links) and Ad links are common on the internet. And even two hyperlinked pages are often irrelevant [5]. In

300


Web forums, the link information becomes more noisy. Most links in forums are automatically generated and only for navigations and operations (e.g. replying a thread or sorting threads by dates), rather than recommendations. On the other side, the useful links, including cross-forum links, are relatively rare. Therefore, completely relying on these noisy links may not be a good solution. In this paper, besides using hyperlinks only, we try to build an implicit graph from the content and further combine it with the original link graph to improve the search performance. As a result, we proposed a novel rank algorithm named Fine-grained Rank (FGRank), which is derived from PageRank. PageRank assumes that the random surfer has no preference when choosing a "reset" page, i.e. jumping to any page with an equal probability. Some later work on PageRank [12, 14], consider that the random surfer has a uniform preference when taking a jump, regardless of his interest context. Actually, surfers' activities often are taskbased and topic-specific. That is, the surfer may more likely jump to a page which is similar to what he is reading currently. To model the dynamic preferences, we assume that any surfer's interests can be represented by the nodes on a topic hierarchy. Most likely, the surfer, who focuses himself on a specific topic, will continue to stay on this topic or drill down to a sub-topic. In this manner, the surfer will have different preferences on each page, determined by the topic of this page. In particular, the proposed approach also can be interpreted as a multi-level smoothing on the link graph. Another contribution of this paper is to present a very efficient algorithm for automatically generating topic hierarchy and map each page in a large-scale collection onto the computed hierarchy. The basic idea is derived from the fact that text categorization can be efficiently conducted based on only a few discriminative words. The rest of this paper is organized as follow. In Section 2, we give a review on link-based ranking algorithms and also introduce background knowledge on our formulation. Then, we present our FGRank algorithm, including automatic hierarchy generation and page categorization, in Section 3. The experimental results are given in Section 4. Finally, Section 5 concludes this paper and discusses the future work.

2. LINK-BASED RANKING ALGORITHMS
Comparing with traditional document retrieval, link-base ranking like HITS [16] and PageRank [7, 21], is a distinct technique in modern Web search systems. Google's PageRank algorithm is widely applied in both academy and industry. It assigns each Web page an "important" score based on the stationary distribution of a random walk on the directed hyperlink graph. Also, PageRank can be intuitively interpreted as simulating the actions of random surfers: consider a random surfer that starts from a random page, and keeps clicking on the links of the current page with a uniform probability, but eventually gets bored and starts again on another random page. More specifically, let P be the stochastic transition matrix of link graph without sink nodes [21], where P (i, j ) = 1/deg (i) if there is a link from the ith page to the j th page, and deg (i) is the out-degree of the ith page. Therefore, the PageRank vector vP R is defined as follow: vP R
T = cP T vP R + (1 - c) [1]n×1 × vbias vP R

where c is the damping factor, which is usually set to 0.85, n is the number of pages in the link graph, E is referred as teleportation containing the artificial jumps, and vbias is named personalization vector and often uniformly set to 1/n. Nonuniform vbias will result the biased PageRank vector with the preferences on certain kinds of pages. The PageRank vector can be computed by the iterative Power Method, with uniform initial ranks. Based on this basic formulation, most of extensions can be classified into three categories from different aspects of PageRank. Damping factor c is not only a smoothing factor of the PageRank vector, but also a useful cue for link-spam detection [6]. Without using a constant damping factor, Dirichlet PageRank [25] employ dynamic damping factors based on the Dirichlet smoothing to improve the retrieval performance. The basic assumption is that the random surfer would more likely follow the out-links of the current page, if the page has many out-links. Personalization vector vbias can bias PageRank scores according to user's preferences. Topic-sensitive PageRank [12] uses 16 different personalization vectors, built from the toplevel topics of the ODP data, to compute 16 PageRank scores for each page. The final PageRank scores are computed at query time as the linear combinations of the precomputed scores. This method cannot scale up because it need more computations for fine-grained topic basis, i.e. more personalization vectors. Jeh [14] later improved it to use hub vectors to construct biased PageRank scores at query time. The algorithm scales well because its cost only depends on the number of non-zero elements in the personalization vector. Our proposed FGRank also utilizes modified personalization vectors to enhance the ranking performance. But not like topic-sensitive PageRank, our method only needs to compute one PageRank score for each page without overhead at query time. Our algorithm is efficient in both computation and storage and also fits our task well. BlockRank [15] use the link structure hierarchy, e.g. domain, hosts, and pages, to build the personalization vectors. The drawback of the method is no considering of the content of the pages. In the traditional PageRank, all the links are equally considered. As aforementioned, the in-site links are different from off-site ones in terms of link properties. Based on the observation, Xue [27] and Kamvar [15] try to leverage the hierarchical structure on Web graph and employ modified stochastic transition matrices to compute the PageRank scores. Again, the page content is ignored. Richardson [22] proposed the Directed Surfer Model which weights links based on the page relevance to given queries. The direct interpretation to the model is a more "intelligent" surfer, who can selectively click out-links according to his information need. Directed Surfer Model will need to pre-compute different PageRank scores for each term. Focused Rank [10] is another work trying to combine the content information with links. This algorithm also does not scale well because it will need to compute PageRank scores at query time. These methodologies are similar to ours but our method can be efficiently computed. In the next section, we will introduce the proposed algorithm in details and also show the connections with some existing methods.

 

Ą

= cP T vP R + (1 - c) E T vP R

(1)

3.

FINE-GRAINED RANK ALGORITHM
Link information is not always reliable, especially in the

301


Assume that each page is only assigned to single node on the hierarchy tree, and let N (i) be the topic node of the ith page. Therefore, the j th element of the ith page's personalization vector is the probability of teleporting from the ith page to the j th page, calculated by: P (N (i), N (j )) vbias (i, j ) = Čn k=1 P (N (i), N (k )) Figure 1: Compute the topic hierarchy biases. forum context. Combining content with links reveals a good direction to improve link-based ranking algorithms. However, most methods in literature are not practical and cannot scale well. The proposed FGRank, which is based on non-uniform personalization vectors built from content similarities, is efficient and has no any overhead at query time. (3)

In the forum context, we have another intuitive explanation on the hierarchy bias. Assume that we have built a meta-forum site by aggregating distributed forum sites, and mapped all the pages onto a topic hierarchy, or directory. So, the surfers have two choices: browse the information at the original sites (link information), or at our new site (content similarity information). Our method is to combine the two random walks respectively on graph P and E in (1).

3.1.1

Comparison with existing methods

3.1 Topic Hierarchy Bias
User's information need is often task-oriented and topicspecific. It is reasonable to assume that the random surfer more likely "teleports" to these pages that are relevant to the current page. Intuitively, we actually need to compute a distinct personalization vector for each page, based on its content similarities to other pages. But the straightforward solution is totally impractical for both computation and storage costs. We will need to compute a full similarity matrix and also heavily increase the cost for each PageRank iteration step. To tackle this problem, we conduct the calculation at a higher level, i.e. the topic level, rather than the page level. Because the number of topics are much small than that of pages, we can greatly reduce the computation and storage complexity in this manner. Topics can naturally be organized in a tree-like structure, namely topic hierarchy. We actually allow to efficiently compute the bias vector based on the hierarchical structure. Given two nodes Ni and Nj on the topic hierarchy tree, the   Ą path from Ni to Nj is written as Np(1) , Np(2) , . . . , Np(m) , where Np(1) = Ni and Np(m) = Nj . Therefore, the potential (we use rather "potential" here than "probability" because the computed values are not normalized to one) of jumping from Ni to Nj is computed by:
m-1   É
k=1

Topic-sensitive PageRank [12] separately computes PageRank scores for each topic, and each topic has its own personalization vector. Our FGRank algorithm also uses different personalization vectors for distinct topics but only needs to compute PageRank once. If there are two pages Pi and Pj , where Pi belongs to the query topic but Pj does not, ideally the topic-sensitive PageRank will gives Pi a higher score than that of Pj . Our method is query independent. If Pi and Pj are highly referenced within each own topic, both of them will obtain high ranking scores. However, after combining with the relevance scores, Pi and Pj still can be correctly ranked. If the given query is relevant to several topics, FGRank will be prone to return the pages in hot topics. A known issue of PageRank is of suppressing newly created page. Similarly, our method may also suppress the pages in less popular topics, for example, emerging topics. Richardson [22] and Diligenti [10] employ content information to re-weight the links according to queries. The ideas are promising but all the algorithms are computationally intensive. Our FGRank will boost the links pointing to the pages within the same topic but suppress cross-topic links. At the same time, out method also introduces strong implicit links between the pages in one topic, which may cause a smoothing effect on the scores.

P Np(k) , Np(k+1) m > 2 m = 2, Ni = P a(Nj) m = 2, Nj = P a(Ni) m=1 (2)

Ą

3.2

Topic Hierarchy Generation

P (Ni , Nj) =

P r (Nj |Ni ) z 1

where P a(Ni) is the direct parent node of Ni , P r(Nj |Ni ) denotes the conditional probability of Nj when given Ni (computed by equation (5)), and z , 0 < z < 1, is called "shift" factor. It controls the probability that the random surfer shifts his interests from the current topic to another topic. The smaller z will lead to a more focused surfer. Topic hierarchy bias can be intuitively explained as: when a surfer is reading a page under certain topic (e.g. food), he more likely jumps to another page related to the same topic or sub-topics (e.g. cake), but less likely teleports to the pages about broader topics (e.g. living) or distinct topics (e.g. sports). A specific sample on topic hierarchy bias is illustrated in Fig. 1.

To compute the topic hierarchy bias, we need a topic hierarchy and each page is assigned to a topic. The direct way is to conduct a hierarchical clustering [13, 24] or categorization [11, 17, 4, 9] on the whole collection. Clustering based methods are often hard to scale up, especially for hierarchical clustering algorithms. While, categorization based methods need the generated hierarchy and some training data. It is less possible to apply these methods directly on our problem. Mostly, user discussions in forum sites are topic-centric. So, we can easily confine diversity of topics in our collection. It means that we are allowed to only use a small number of pages, sampling from the whole collection, to generate the topic hierarchy and then categorize the rest of pages based on the computed hierarchy. A similar idea also can be found in [19]. The widely applied vector-space model represents documents as high-dimensional and sparse vectors. Feature selection is helpful for some text categorization methods. In many cases, we can even only use a small set of discrimi-

302


Hierarchical Co-clustering for Hierarchy Generation
1. Given the m × n document-by-word matrix A, remove stop words (columns) from A by checking DF values. Then, perform SVD decomposition by:

A

D1 = diag
n

[

1] 1 × m A

D

2

= diag

A

[1]n×1

= D1

-1/2

AD2

-1/2 (truncated SVD)

- - - - - - An = U V T -----

Figure 2: Find do cument clusters and asso ciated discriminative words by transforming the do cumentby-word matrix into a blo ck-diagonal matrix.

2. Let the m largest singular values be 1  2  ˇ ˇ ˇ  m . Then determine the cluster number k by the eigengap: k = arg max(mi>1) (i-1 - i ) /i-1 3. Find corresponding k singular vectors of An , u1 , u2 , ˇ ˇ ˇ , uk and v1 , v2 , ˇ ˇ ˇ , vk and form the matrix Z by:

native words to efficiently categorize documents [17]. With this observation, we expect a clustering algorithm which can not only generate document clusters but also give associated word clusters. And the word clusters should be the most discriminative for each cluster and can be efficiently used for later categorization. The supervised methods also employ similar approaches to enhance the feature selection [4, 9]. Consider a document-by-word matrix A such that Ai,j is the frequency of the j th word in the ith document. Our purpose actually tries to transform the matrix A into a block-diagonal matrix, as shown in Fig. 2. This clustering method is called co-clustering which is derived from the problem of bipartite graph partition [8]. For completeness, we also include the algorithm (also with our small modifications) in this paper, shown in Fig. 3. Co-clustering algorithm simultaneously groups the documents and words into clusters. Each cluster contains both documents and associated words. But it is still a flat clustering algorithm. In this paper, we extend the algorithm to generate the hierarchical clusters. Basically, we can simply use the co-clustering algorithm recursively to obtain the topic hierarchy. However, to locate the most discriminative word clusters for efficient categorization, we need to continuously refine the word clusters after each round of coclustering. Assume mapping function C (x) gives the cluster label of x, where x can either be a document or a word. We define the discriminative power to judge the quality of each word, computed as: P ower (wi ) = |{j |C (dj ) = C (wi ) , Aj,i > 0}| |{j |Aj,i > 0}| (4)

ž

W

Z=

D1 1/2 [u1 , . . . , uk ] -
D2 [v1 , . . . , vk ]

-1/2

ż

e found in experiments that using more singular vectors may sometimes lead to a better performance. This observation is also reported before. 4. Normalize each row of Z to have unit length [20]. Run k-means to cluster m + n rows of Z into k clusters. The first m labels are documents', and the rest n labels are words'. 5. For each cluster, compute the discriminative power of words by (4). If the number of documents in a cluster is smaller than a threshold, then stop; otherwise, goto Step 6. 6. Form a sub-matrix induced from A by selecting all the documents in the cluster and a part of discriminative words from the same cluster by using the discriminative power of words. It means that less discriminative words are removed. Then recursively run the algorithm on the formed sub-matrix.

Figure 3: The hierarchical co-clustering algorithm for automatically generating topic hierarchy. likelihood estimation of P r(Nj |Ni ) is given by: P r (Nj |Ni ) = |{j |C (dj ) = Nj , dj  D}| |D | (5)

where D is the set of documents, and |D| is the total number of documents.

3.3

Hierarchical Categorization

where wi and dj denote the ith word and j th document in matrix A, and |S | is the cardinality of set S . P ower(w) gives the probability that a document belongs to the cluster C (w) if it contains the word w. To maximize the dissimilarity between clusters, we filter out the ambiguous words in each cluster whose discriminative power is lower than a given threshold. Actually the refinement step can be considered as the feature (i.e. word) selection widely used in text categorization. For our purpose, we care more about speed rather than accuracy. We use the formula (4) for efficiency, and other feature selection techniques [28] can also be used for the purpose. A complete description to the hierarchical co-clustering algorithm is given in Fig. 3. The conditional probability P r(Nj |Ni ), Ni = P a (Nj ), aforementioned in Section 3.1, is estimated in this step. Without loss of generality, we may assume that Ni is the root node of the topic hierarchy. Therefore the maximum

When the topic hierarchy is generated, we can also obtain a set of discriminative words on each node. Then we may build efficient classifiers based these words without the training process. Given an unlabeled document d, we start from the root node, and recursively categorize the document in a decision-tree-like mode. Without loss of generality, let N be the root node, and C h(N ) be the set of the root's children. The document likelihood to a node Ni  C h(N ) is computed by summing the evidence from words: L (d, Ni ) =
w

P ower (w|Ni ) E xist(d, w)
D W (Ni )

(6)

where DW (N ) is the set of discriminative words associated with node N , P ower (w|N ) is the discriminative power of word w given node N , and E xist(d, w) is a binary indicator. The binary indicator equals 1 when word w is in document d, but 0 otherwise. Suppose the document d gives the maximum likelihoods on two nodes Np and Nq in C h(N ), and L(d, Np )  L(d, Nq ) 

303


Figure 4: Categorize page on the computed topic hierarchy. Wi denotes a word and its discriminative p ower is given b elow the word. L(d, Ni ), i = p, q . We define the topic purity of document d on node N by: P urity (d, N ) = L (d, Np ) - L (d, Nq ) L ( d , Nq ) (7)

Figure 5: The results of two-level co-clustering. where si is a indicator. It equals 0 when di has no out-links, but 1 otherwise. The detailed proof is given in the appendix.

When the topic purity is close to zero, it is no need to continue classifying the given document d at the lower level because it becomes ambiguous on topic Np and Nq . Particularly, if the purity value is smaller than a threshold, we label document d as node N ; otherwise we choose node Np and continue to classify document d at the lower level. A specific sample is given in Fig. 4 to illustrate the categorization process. Additionally, the proposed categorization method can be efficiently applied in the context of indexing. Our approach only depends on the existences of discriminative words. It is already available in the phase of building the inverted index. For each document, we can first calculate document likelihoods to all the topic nodes in the indexing process, and then categorize the document later. So the overhead is very limited.

4.

EXPERIMENTS

The experiments are divided into two parts. First of all, we evaluate the algorithms for automatic topic hierarchy generation and hierarchical categorization. Then, we compare the proposed ranking algorithm with PageRank on real data gleaned from Web forums.

4.1

Automatically hierarchical categorization

3.4 Computation Issues
As aforementioned, our method only produces single PageRank score for each page. So, at query time, FGRank will not bring in any overhead to the retrieval engine. Although we employ non-uniform personalization vectors, the teleportation matrix E is not full rank. Therefore, in this section, we introduce how to compute FGRank efficiently. Let H = {N1 , N2 , . . . , Nk } be the generated topic hierarchy and each node Ni represents a topic on the hierarchy. Given the ith page di in the collection, C (di ) is the topic i label of page di and vbias is the personalization vector of di , computed by (3). Actually, it is easy to proof that there are only k distinct personalization vectors. Suppose two page di and dj satisfies C (di ) = C (dj ) = N , they will share same N personalization vector, denoted as vbias . Therefore, we can rewrite (1) according to FGRank: vP R = cP T vP R + (1 - c)E T vP R
˘ Ł

To nicely proof our algorithm in the context of Web forums, we used the 20-Newsgroups (NG20 ) dataset [18] to evaluate our algorithm. The NG20 dataset consists of approximately 20,000 messages evenly collected from 20 different usenet newsgroups. The pre-processing steps include: removing stop words (based on a general stop word list) and very-low-frequency noise words, ignoring the message headers except the sub ject lines. We use 70% data to build a two-level topic hierarchy and then classify the remaining 30% data based on the computed hierarchy. Also, to facilitate the result evaluation, we ignore the purity check and force each message only assigned to the leave nodes.

4.1.1

Hierarchical clustering

1 2 n = cP T vP R + (1 - c) vbias , vbias , . . . , vbias vP R

= cP vP R + (1 -
T

Č Nj c) vbias vP R (i) Nj H C (di )=Nj

Č

(8)

N Because vbias can be pre-computed, the overhead of FGRank is approximately equal to appending k non-sparse columns into the stochastic transition matrix P . In particular, to deal with the sink nodes, we can modify (8) as:

vP R = cP T vP R +

N

j vbias

N

C

(1 - csi ) vP R (i)
(di )=Nj

(9)

j H

Fig. 5 gives the generated hierarchy and associated newsgroups in NG20. The association is calculated by finding the maximum overlap between cluster nodes and NG20 newsgroups. It is easy to see that the computed hierarchy matches the newsgroup topics very well. Our method could not only successfully distinguish between "rec.sport.baseball" and "rec.sport.hockey", but also indicate their similarity by grouping them at the upper level. In particular, we also give the confusion matrix at the first level in Table 1. The messages in "sci.electronics" and "misc.forsale" are confused on two generated clusters. However, these results are understandable because their boundaries may not be very clear, e.g. "sci.electronics" may be overlapped with "comp.*.hardware". Overall, we employ the categorization measures, macro-averaged and micro-averaged F1, to evaluate the overall performance on clustering algorithm, given in Table 2. Evidently, our results are promising and may approach to some supervised methods. And even at the second-level, where 13 classes are considered, the F1 values are still higher than 0.76.

304


35

Table 1: Confusion matrix at the first-level
1 alt.atheism soc.religion.christian talk.religion.misc sci.crypt talk.politics.guns talk.politics.mideast talk.politics.misc rec.autos rec.motorcycles sci.electronics sci.med sci.space comp.graphics 2 3 7 0 4 16 1 22 4 5 2 1 4 35 2 3 8 4 6 2
25 30

Node Number Page Number (10k)

654 19 698 0 0 6 10

0 11 2 9 8 17 7 2
0 Level 0 Level 1 Level 2 Level 3 Level 4 Level 5 5 15 20

474 144 45 657 2 519 9 8 4 7 5 6 2 2 2 5 3 5 0 3

546 132 5 47

134 20

10

112 437 84 1 0 5 16 1 2

670 6 686 4 325 70 39 24 13 10 26 33 22

278 15 15 6

609 16 645 18 51 4 0 4 26

Figure 6: No de and categorized page distributions on each level of the topic hierarchy. All times were computed on a standard 2GHz Pentium PC running Windows XP.

621 11 675 9 669 3 641 16 642 7 359 71 5 2 679 688

comp.os.ms-windows.misc 0 comp.sys.ibm.pc.hardware 0 comp.sys.mac.hardware comp.windows.x misc.forsale rec.sport.baseball rec.sport.hockey 1 0 2 2 0

4.2

FGRank Evaluation

236 27 11 7 3 0

Table 2: The overall p erformances on hierarchical clustering and categorization Marco-F1 Hierarchical Clustering Hierarchical Categorization Level 1 Level 2 Level 1 Level 2 Level 2 (removed) 0.842 0.769 0.828 0.708 0.766 Micro-F1 0.846 0.763 0.842 0.713 0.754

The experimental data are collected from four Web forums, which all focus on specific topic, i.e. mobile devices. After the pages are crawled and stored locally, we conduct a pre-processing step to remove noise pages, e.g. user profile, error, login, and reply pages. Finally, we obtain the cleaned collection containing more than one million pages. The average size of these forum pages is about 60.7k, which is much larger than common web pages. We also found that the average number of links on each page is 85.9. But there are only 24.6% links pointing to the pages in our cleaned collection. It means that more than 3/4 links are leading to noise pages. So, the clean processing is a critical step in our context.

4.2.1

Topic hierarchy

4.1.2

Hierarchical categorization

The categorization performance is also shown in Table 2. When only considering the first level of topics, i.e. 6 categories, the categorization performance is pretty good and approaches to the clustering's. But the performance drops evidently at the second level. Even so, the F1 values are still higher than 0.7. If we ignore the two ambiguous newsgroups, namely "sci.electronics" and "misc.forsale", the F1 performance rises to around 0.76, shown as "Level 2 (removed)" in Table 2. Although our algorithm is not as good as state-of-art supervised methods, it can be directly derived from our clustering results, and classify pages on-the-fly. It is at a good balance point on both the speed and the performance. Moreover, our methods are unsupervised, which can be done fullautomatically without human labels. In this collection, i.e. 14,000 messages for clustering and 5,997 messages for categorization, we only need no more than 30 CPU seconds for generating the two-level hierarchy and no more than 10 CPU seconds for classification.

The topic hierarchy is built from 60,000 pages randomly sampled from the whole collection. The generated hierarchy contains 68 nodes spanning 5 levels. Fig. 6 shows the distributions of node and categorized pages on each level of the topic hierarchy. The threshold for topic purity is set to 0.3. If the classifier built from only 60,000 pages cannot be generalized to the whole collection, for most pages, the categorization algorithm will stop early at top levels. However, it is easily to find from Fig. 6 that a large number of pages are categorized to the nodes at deeper levels, i.e. 3 and 4. These statistics indicates that our strategy of combining clustering with categorization is successfully applied to the forum page collection.

4.2.2

Retrieval Performance

We first collect some frequently asked questions from these forum sites, and then refine these questions into query terms. Finally, we obtain 25 representative queries (e.g. "SIM card problem", "how delete ringtone") which cover different components on mobile devices, for example Bluetooth and SIM card, and manufactures. We first use BM2500 [23] to find the top 100 pages for each query, and then re-order the 100 pages according to the PageRank and FGRank respectively. We keep top 30 pages from the three ordered lists and man-

305


0.82 BM2500 0.80 PageRank FGRank 0.78

0.82 MAP P@10 0.80

0.78

0.76
0.76

0.74
0.74

0.72
0.72

0.70
0.70

0.68

0.66 MAP P@10

0.68 0.1 0.2 0.3 0.4 0.5 Damping Factor 0.6 0.7 0.8 0.9

Figure 7: The p erformance comparison b etween BM2500, PageRank and FGRank.

Figure 8: The p erformance changes with the damping factor. sults indicate that FGRank outperforms PageRank in our context. Also, our algorithms, proposed for topic hierarchy generation and hierarchical categorization, are proven to be efficient and effective. Although the algorithms proposed in this paper are motivated from the problems in forum search, the basic assumption is still applicable to general Web search. The work in [15] and [27] tries to improve link-based ranking algorithm by utilizing the underlying hierarchical structure on Web graph. We also argue that such kind of hierarchical structure also exists from the perspective of content. So the future work includes: 1) introducing the content hierarchy to improve general web search; 2) combining two types of hierarchical structure respectively from link and content.

ually judge the relevance of page on the queries. Overall, we obtain 1744 pages with relevance judgments. We linearly combine the relevance score, BM2500, with the link-based importance scores, namely PageRank and FGRank. The parameters are tuned to achieve the best performance on annotated queries. We take P@10 and Means Average Precision (MAP) as the evaluation metrics which are widely used in TREC. The details can be found in [3]. The overall performance is given in Fig. 7. The discussions in forums are highly topic-specific. Important terms are often frequently mentioned on the whole page. It makes the relevance scores very effective to do retrieval. So we can see that BM2500 achieves much higher performance than the general Web search task. Even so, FGRank still outperform BM2500 and PageRank by 8.1% and 5.9% on MAP, and by 7.4% and 3.6% on P@10. We also conduct a t-test on the results. It shows that our improvements on MAP are significant (0.04 for PageRank, and 0.03 for BM2500). Because most of P@10 scores approach to 1, our improvements failed to pass the t-test on this measure.

6.

REFERENCES

4.2.3

Combining factor for link and content

The damping factor can be considered as a combining factor for link and content information in FGRank. We also explore how the factor impact on the performance. We use 6 different factor values and evaluate their performance, as shown in Fig. 8. The best combination is achieved when c takes 0.3. It is clear to see that the implicit links built from the content is more important than the explicit link information. But, the link information is still useful. If we continue to increase the weight on content, the performance dropped eventually.

5. CONCLUSION
Most links in forums are not for "recommendations", but for navigations and operations. In this paper, we proposed a link-based ranking algorithm, named Fine-grained Rank (FGRank), to overcome the problem. The basic idea can be interpreted to be that the random surfer may more likely jump to a page that is relevant to his current page. We build implicit links from content and further combine them with hyperlinks through the non-uniform bias. Experimental re-

[1] Google search engine. http://www.google.com. [2] Yahoo! search engine. http://search.yahoo.com. [3] R. Baeza-Yates, F. Saint-Jean, and C. Castillo. Web dynamics, age and page quality. In Proc. of SPIRE 2002, Lisbon, Portugal, September 2002. [4] L. D. Baker and A. McCallum. Distributional clustering of words for text classification. In Proc. of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 96­103, 1998. [5] D. Bergmark, C. Lagoze, and A. Sbityakov. Focused crawls, tunneling, and digital libraries. In Proc. of the 6th European Conference on Digital Libraries, pages 91­106, September 2002. [6] P. Boldi, M. Santini, and S. Vigna. Pagerank as a function of the damping factor. In Proc. of the 14th international conference on World Wide Web, Chiba, Japan, May 2005. [7] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proc. of 7th International World Wide Web Conference, May 1998. [8] I. S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In Proc. of the 7th ACM SIGKDD Conference on Know ledge Discovery and Data Mining, pages 269­274, 2001.

306


[9] I. S. Dhillon, S. Mallela, and R. Kumar. Enhanced word clustering for hierarchical text classification. In Proc. of the 8th ACM SIGKDD International Conference on Know ledge Discovery and Data Mining, 2002. [10] M. Diligenti, M. Gori, and M. Maggini. Web page scoring systems for horizontal and vertical search. In Proc. of the 11st International World Wide Web Conference, May 2002. [11] S. Dumais and H. Chen. Hierarchical classification of web content. In Proc. of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, August 2000. [12] T. H. Haveliwala. Topic-sensitive pagerank. In Proc. of the 7th International World Wide Web Conference, 2002. [13] A. K. Jain and R. C. Dubes. Algorithms for clustering data. Prentice Hall, 1988. [14] G. Jeh and J. Widom. Scaling personalized web search. In Proc. of the 12th International World Wide Web Conference, 2003. [15] S. D. Kamvar, T. H. Haveliwala, C. D. Manning, and G. H. Golub. Exploiting the block structure of the web for computing. Technical report, Stanford University, Stanford, CA, 2003. [16] J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604­622, 1999. [17] D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In Proc. of the 14th International Conference on Machine Learning, pages 170­178, 1997. [18] K. Lang. News weeder: Learning to filter netnews. In Proc. of 12th International Conference on Machine Learning, pages 331­339, 1995. [19] T. Li, S. Zhu, and M. Ogihara. Topic hierarchy generation via linear discriminant pro jection. In Proc. of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, Toronto, Canada, 2003. [20] A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems, volume 14, pages 849­856, 2002. [21] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford University, Stanford, CA, 1998. [22] M. Richardson and P. Domingos. The intelligent surfer: Probabilistic combination of link and content information in pagerank. In Advances in Neural Information Processing Systems, volume 14,

Cambridge, MA, 2002. MIT Press. [23] S. E. Robertson. Overview of the okapi pro jects. Journal of Documentation, 53(1), 1997. [24] S. Vaithyanathan and B. Dom. Model-based hierarchical clustering. In Proc. of 6th Conference on Uncertainty in Artificial Intel ligence, pages 599­608, 2000. [25] X. Wang, A. Shakery, and T. Tao. Dirichlet pagerank. In Proc. of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 661­662, Salvador, Brazil, 2005. [26] W. Xi, J. Lind, and E. Brill. Learning effective ranking functions for newsgroup search. In Proc. of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pages 394­401, Sheffield, United Kingdom, 2004. [27] G. R. Xue, Q. Yang, H. J. Zeng, Y. Yu, and Z. Chen. Exploiting the hierarchical structure for link analysis. In Proc. of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, Salvador, Brazil, August 2005. [28] Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In Proc. of the 15th International Conference on Machine Learning, pages 412­420, 1997.

APPENDIX A. COMPUTE FGRANK WITH SINK NODES
In this appendix we give the proof on the conclusion given in (9). Let s be n-dimensional column vector identifying the nodes with out-links:
0

si =

deg (i) = 0 1 otherwise

(10)

where deg (i) is the number of out-links of page i. To simplify the notions, let s = 1 - s. So we construct P from P as Ż follow: P
=

P + diag (s)E Ż
i

(11)

Because the row sum of E always equals 1, P s a graph a without sink nodes. Replace P with the modified graph P nd rewrite (1) as follow: vP R = c (P
T )T

vP R + (1 - c) E T vP R (12)
T

= cP T vP R + cE T diag (s)vP R + (1 - c) E T vP R Ż = cP vP R + E diag (1 - cs)vP R

Follow the proof presented in (8), and then we can easily get the conclusion on (9)

307