Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova Max Planck Institute for Informatics Stuhlsatzenhausweg 85 66123, Saarbrucken, Germany ¨ Max Planck Institute for Informatics Stuhlsatzenhausweg 85 66123, Saarbrucken, Germany ¨ Gerhard Weikum angelova@mpi-inf.mpg.de ABSTRACT Automatic classification of data items, based on training samples, can be boosted by considering the neighborhood of data items in a graph structure (e.g., neighboring documents in a hyperlink environment or co-authors and their publications for bibliographic data entries). This paper presents a new method for graph-based classification, with particular emphasis on hyperlinked text documents but broader applicability. Our approach is based on iterative relaxation labeling and can be combined with either Bayesian or SVM classifiers on the feature spaces of the given data items. The graph neighborhood is taken into consideration to exploit locality patterns while at the same time avoiding overfitting. In contrast to prior work along these lines, our approach employs a number of novel techniques: dynamically inferring the link/class pattern in the graph in the run of the iterative relaxation labeling, judicious pruning of edges from the neighborhood graph based on node dissimilarities and node degrees, weighting the influence of edges based on a distance metric between the classification labels of interest and weighting edges by content similarity measures. Our techniques considerably improve the robustness and accuracy of the classification outcome, as shown in systematic experimental comparisons with previously published methods on three different real-world datasets. Categories and Sub ject Descriptors: I.5.2 [Pattern Recognition]: Classifier design and evaluation H.3.3 [Information Systems]: Information Search and Retrieval General Terms: Theory, Algorithms, Reliability Keywords: Exploiting Link Structure, Automatic Classification weikum@mpi-inf.mpg.de represent each data item by a feature vector, e.g., the frequencies of words or word stems in a Web page, and learn parameters of mathematical decision models such as Support Vector Machines, Bayesian classifiers, or decision trees, based on intellectually labeled training data. When the trained classifier is later presented with test data with unknown category labels, the standard paradigm is to apply the decision model to each data item in a "context-free" manner: the decision is based only on the feature vector of a given data item, disregarding the other data items in the test set. In many settings, this "context-free" approach does not exploit the available information about relationships between data items. For example, if we are asked to classify a book into a genre, we can take advantage of some additional information like the genre of other books by the same author or what other books were bought by the readers of this one. Similarly, the hyperlink neighbors of a Web page give us clues about the topic of a page. Using the relationship information, we can construct a graph G in which each data item (e.g., Web page) is a node and each relationship instance (e.g., a hyperlink) forms an edge between the corresponding nodes. Then the classification problem can be formulated as a graph labeling or coloring problem on such a graph. In the following we will mostly focus on text documents with links to and from other documents; thus, we will often use the document/link terminology, but the approach is applicable to a wide variety of graph types derived from structured, semistructured, or unstructured data items and their relationships. A straightforward approach to capturing a document's neighbors (i.e., the directly related nodes in the graph) would be to incorporate the features and feature weights of the neighbors into the feature vector of the given document itself. This has indeed been attempted in the literature for the case of hyperlinked text documents. However, such feature engineering seems to face tremendous problems of parameter tuning, e.g., for adjusting the weights of neighbor features when added to the document's own features, and does generally not lead to a robust solution [3]. Rather, a more advanced approach is to model the mutual influence between neighboring documents, aiming to estimate the class labels of all test documents simultaneously. In principle, such a model could even consider long-range influences among transitively related documents, with decreasing influence as the distance in the graph increases. For tractability, however, it makes sense to focus on the strongest dependencies among immediate neighbors. Such a model is called a first-order Markov Random Field or MRF [14, 17]. Computing the parameters of an MRF such that the likelihood of the observed training labels is maximized is a difficult problem that cannot be solved in closed analytic form and 1. INTRODUCTION Automatic classification is a supervised learning technique for assigning thematic categories to data items such as customer records, gene-expression data records, Web pages, or text documents. It has numerous applications in fields like data mining, information retrieval, Web analysis, business analytics, bioinformatics, etc. The standard approach is to Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGIR'06, August 6­11, 2006, Seattle, Washington, USA. Copyright 2006 ACM 1-59593-369-7/06/0008 ...$5.00. 485 ? ? d ? ? Marginal distribution of topics: .5 Triangle .5 Rectangle ? ? ? ? ? ? ? ? ? ? ? d ? ? ? d ? Topic citation matrix 0.62 ? ? 0.38 0.75 ? ? ? ? ? ? 0.25 (a) (b) 6 (c) (d) 4 Figure 1: Relaxation lab eling (RL) illustration. is typically addressed by an iteration technique known as relaxation labeling (RL). Our approach builds on this mathematical technique. A simple example for RL is shown in Figure 1. Assume a partially supervised classification scenario, i.e., some of the documents in the test graph have already assigned class labels as in Figure 1 a). Let our set of classes be C = { , 2}. We wish to assign to every document marked "?" its most probable label. The label assignments have to be consistent with the observed data from the training procedure. For our example purposes, let the contingency matrix in Figure 1b) be estimated from the training data. It contains the probabilities of class pairs in neighboring nodes. As discussed, the RL technique aims to combine the available node information with the information that can be induced from the graph structure. First it uses a bootstrap mechanism to assign initial labels to all documents without any label. In our case, a content-only classifier assigns labels or 2 to all test documents marked with "?", shown in Figure 1 c). Later, RL iteratively corrects this initial assignment if the neighboring documents have labels that are unlikely according to the matrix in b). Suppose after step c), test document d is t ssigned a label 2. Since all neighbors of d are labeled a ahe RL procedure may decide to "correct" this label to nd output label assignments as in Figure 1d). which combines the text features for every given test document with the label assignments of its neighbors and iteratively tries to improve the classification result. [15] distinguishes three ways of constructing a feature vector for each document. In Binary mode, a document feature vector includes its "text" features and their weights plus additional new features, derived from the underlying link structure, namely every class label ci C that appears at least once in the neighborhood of d, N (d). In Count mode, the document feature vector is enhanced with the neighbors' labels ci weighted by their frequencies in N (d). In Single mode, only the most popular class label is included into the feature vector of document d, with weight equal to its frequency in N (d). In general, graph-aware mining has lately received much attention in machine learning, but the ma jor focus has been on issues like authority ranking, information extraction, entity disambiguation, and relation learning (see, e.g., [10, 18, 6, 9, 7, 5, 13, 2]). 2 1.2 Contribution We propose a new algorithm that builds on, but extends and generalizes the earlier work by Chakrabarti [3] that uses the theory of Markov Random Fields to derive a relaxation labeling technique for the graph labeling problem. The major novelties lie in making the approach more robust (i.e., less susceptible to particularities of the data or hand-tuned parameters) and achieve higher classification accuracy than the prior work. Salient points of our graph-based classifier (GC) are: · We capture and exploit the link/class patterns in the complete graph dynamical ly as RL iterates, which is in contrast to the method by [3] where the separation cost is derived from the link/class pattern of the training data only and only once at training time. · For enhanced robustness, we select a "reliable" set of neighbors for each test document by means of a similarity threshold. We consider only the links for which a similarity measure between the feature vectors of two neighbors is sufficiently high, thus eliminating the danger of overreacting to very noisy local environments. · As additional information, disregarded in all prior work but the theoretical paper [12], we consider a metric distance between each pair of category labels (c, c ) in the set of possible labels C . This can be exploited in situations where, for example, a node has many neighbors with label "chemistry" that may provide support for the hypothesis that the node is about "biology" (assuming the node's content, e.g., frequent occurrence of the word "life", initially suggests high probability of the topic "biology"), whereas another node with many neighbors on "politics" is much less likely to have the true label "biology". · We discriminate different levels of trust in different neighbors by assigning weights to links based on the content similarity of nodes. · In contrast to the prior work of [3, 16] which was restricted to using a Naive Bayes method for the initial per-node classifier so as to stay within the probabilistic framework of MRFs, our approach can incorporate other, more powerful, per-node classifiers such as SVM. We present a systematic performance evaluation of our GC algorithm, based on three different real-world datasets and with detailed comparison to all known competitors [3, 15, 16]. Our method achieves significant gains in accuracy in 1.1 Related work The theory paper by Kleinberg and Tardos [12] views the classification problem for nodes in an undirected graph as a metric labeling problem where we aim to optimize a combinatorial function consisting of assignment costs and separation costs. The assignment costs are based on the individual choice of label we make for each ob ject while the separation costs are based on the pair of label choices we make for two neighboring ob jects. The combination of the assignment and the separation costs gives the total cost. A labeling that minimizes the total cost is called a maximal ly likely labeling of the test graph. [12] shows that the metric labeling problem is NP-hard and presents an approximation scheme for this problem, based on integer linear programming. Instead of seeking a global optimization, which obtains a unique labeling for all nodes in the test graph, S. Chakrabarti et al. [3, 1] propose to start with a greedy labeling of the graph, paying attention only to the node-labeling (assignment) cost, and then iteratively "correct" the neighborhood labeling where the presence of edges leads to a very high penalty in terms of the separation costs. In the context of link information, the relaxation labeling algorithm first uses a Naive Bayes text classifier to assign class probabilities to each node. Then it considers each page in turn and reevaluates its class probabilities in light of the latest estimates of the class probabilities of its neighbors. Oh et al. [16] present a "single step" approach in which the label of each node d in the graph is influenced by the popularity of this label among all immediate neighbors of d and the level of confidence in the labels of the documents in the neighborhood. To this end, the term weight wt for any document is adjusted using the term frequencies in the neighboring documents and a parameter that controlls the degree of influence. A regression model was proposed by Lu and Getoor [15] 486 some cases and performs as well as the best alternative in all other cases. All previously proposed methods, while showing good results in some cases, fail in robustness as their performance dramatically deteriorates in other cases. Some of our novel techniques could also be integrated into prior methods, but combining all of them into the full-fledged GC algorithm provides synergies and achieves the best performance. This can be computed in an iterative RL manner as follows: (r -1) d d= (r ) c,d = c,d · Pr (d) = c c (N (d) ) N ( d ) 2. 2.1 GRAPH-BASED CLASSIFICATION Theoretical framework where r > 1. With the short-hand notation c,c = Pr [ (d) = c (d ) = c ] we can rewrite this into: (r -1) d (r ) c,d = c,d · c,c (1) (N (d)) N (d) Our approach is based on the probabilistic formulation of the classification problem and uses a relaxation labeling technique to derive two ma jor approaches for finding the maximally likely labeling of the given test graph: hard and soft labeling. Let D be a set of documents. Also, let G be a graph whose vertices correspond to documents d D and edges represent the link structure of D. As input, the classifier has the text of each document d and information about which documents of G constitute its neighborhood, N (d). Since we consider graphs whose nodes are being labeled we will refer to nodes in the graph and documents interchangeably, as well as to edges and links. (u) denotes the label of node u, either an initial, a final, or an intermediate label whose validity can be associated with a probability. The feature vector that locally captures the content of document d, derived from the set of terms that occur in d, is denoted by (d). The output of the classification should be an assignment of labels to the graph nodes such that each document d has its maximally likely label c that belongs to a finite set of labels C . Taking into account the underlying link structure and document d's content-based feature vector, the probability of a label c C to be assigned to d is: Pr [ (d) = c | , G] = Pr [ (d) = c | (d), (d1 ) , · · · , (dl )] where d1 through dl are the documents in D. In the spirit of the introduction's discussion on emphasizing the influence of the immediate neighbors for each document, N (d), we obtain Pr [ (d) = c | (d), G] = Pr [ (d) = c | (d), N (d)] and denote it by c,d . This reflects the MRF assumption that the label of a node (as a random variable) is conditionally independent of the labels of other nodes in the graph given the labels of its immediate neighbors. We abbreviate Pr [ (d) = c | (d)], the graph-unaware probability based only on d's local content, by c,d . The probability of a specific labeling, (N (d)), to be assigned to the test document neighborhood N (d) and is denoted by Pr [ (N (d))]. Applying, for tractability, the additional independence assumption that there is no direct coupling between the content of a document and the labels of its neighbors, the following central equation holds for the total (prior, i.e., unconditional) probability c,d , summing up the posterior (i.e., conditional) probabilities for all possible labelings of the neighborhood: c,d = c,d · Pr [ (d) = c | (N (d))] · Pr [ (N (d))] (N (d)) So far, we presented our basic framework. The remaining part, which poses the ma jor difficulties and where our innovation lies, is the estimation of parameters, for the initial solution and during the iterations. For iteration (r = 0) all test nodes receive their initial labels from the purely contentbased classifier, e.g., a Multinomial Naive Bayesian Classifier or an SVM classifier. All iterations that follow are based on Equation (1). We iterate until the probabilities c,d , for each document and class, stabilize, i.e., the magnitude of change drops below some stop parameter . The relaxation is guaranteed to converge to a locally consistent assignment if initiated sufficiently close to a consistent labeling [14, 17]. In [17] it is shown that the relaxation algorithms are local optimizers by nature (similarly to EM methods). They do not necessarily arrive at the global optimum. Given a relaxation labeling algorithm and the data, two factors affect the solution: the initial label assignment and the cost function used to iteratively find the global maximum of the labeling function. In our case, the iterative scheme contains both factors: the initial label assignment from the contentbased classification in iteration r = 0, and the cost function involving a dynamical ly computed separation cost (i.e., labeling neighbors with different topics), re-adjusted in each iteration. The latter is absent in the prior work [3], which computed its notion of separation cost only once upfront. Calculating the sum over all possible labelings in Equation (1) is hard as we have m|N (d)| summands, where m is the number of distinct class labels. To solve this problem we employ two ma jor methods described in Subsections 2.2 and 2.3. Depending on the chosen method of classification, we approximate the sum over all possible labelings of the neighborhood to either its most significant summand, treating it as if it were the true set of labels (hard labeling), or the p most significant summands (soft labeling) and their associated probabilities where p is a tunable constant. The latter method was employed in [3], but computes these neighborlabeling probabilities only once before the first RL iteration whereas our GC method efficiently re-computes the updated probabilities after each RL iteration. To this end, we propose a fast algorithm to compute the p most significant summands, and present it in Section 2.2.1. 2.2 Soft Labeling The soft labeling approach aims to achieve better accuracy of the classification by avoiding the overly eager "rounding" that the hard labeling approach does. Instead, we take into account the p most significant combinations of labelings of the test document neighborhood (Equation (1)). This is motivated by the observation that apart from the few most probable labelings of the neighborhood (p), the remaining combinations of class assignments have very low probabilities. Thus, they do not contribute much to the calculation (r ) of c,d (the probability of document d to be labeled c) and can be ignored. This reduces the exponential number of summands in Equation (1) from m|N (d)| to p and makes its computation feasible. In the same vein, if we further assume independence among all neighbor labels of the same node (but still capturing the dependence between a node and each of its neighbors), we reach the following formulation for our neighborhoodconscious classification problem: d d= c,d = c,d · Pr (d) = c c . (N (d)) N (d) 487 - log c ,di d i -1 di dn 2.4 - log c,di Figure 2: Log-likeliho o d graph. The probability c,c of a pair of labels c, c to be assigned to the endpoints of the edge between documents d and d depends on the class-conditional probabilities of the corresponding nodes from the previous iteration. We have developed two ma jor ways of calculating c,c : · by a smoothed estimator based on the frequencies of edges tentatively labeled with c, c in round (i - 1) of the iteration scheme (using Laplace smoothing); · as the probability of edge labels derived from the test graph as of iteration (i - 1) by multiplication of two probabilities: the probability of label c to be assigned to document d and the probability of label c to be assi10 ned to document d N (d), and then averaging g over all edges in the graph G. We use the first method in combination with the hard node labeling approach and abbreviate the classifier as GC [H ]. The latter one is used with the soft node labeling method. The corresponding abbreviation is GC [S ]. Conditional probability of edge labeling Finding p most significant summands Reducing the number of possible neighborhood labelings corresponds to finding the p shortest 0 n paths in the graph shown in Figure 2. To do that we propose an efficient algorithm that works as follows. In the graph each neighbor of d is a node i (i = 1..n) and the edges between the nodes have weights (- log c,d ) sorted in decreasing order. We denote by i the ith shortest path and by ij , the path obtained by changing the edge between nodes vj and vj +1 in i with the next higher edge. To avoid the problem that a path can have more than one parent, we enforce the condition that if path k is formed as ij , i.e., derived from path i by changing the j th edge, then it is not allowed to create paths kl with l < j . We store the lengths of all paths ij (as well as its corresponding i and j ) that can be derived from each path i computed so far. We create the next path by picking the smallest path length. The running time of this algorithm is O(nm log m+pn log p) where n is the number of nodes and m is the number of distinct labels. Its advantages are simplicity of implementation and a slightly worse running time than the optimum O(pn). The approach used in [3] and proposed by Eppstein [8] is very complicated to implement and does not yield the explicit paths but some encoding of the paths from which one can compute the actual paths incurring extra cost. 2.2.1 2.5 Edge pruning and weighting Our method uses link weights based on the following rationale. The label of a document should be more influenced by neighbors with a homogeneous content, that is thematically closely related to the document's own content. As an intuitive example suppose we are interested if a given Amazon product page sells cosmetics or routers. Each page contains highly valuable outgoing links, e.g., links to the product description on the manufacturer page or comparison between similar products. But all too often we would also find on the same product page many non-relevant links like a link to the Amazon's generic "latest products" page, user's shopping cart, profile, etc. We aim to ignore the unnecessary and most probably noisy information behind these irrelevant links by assigning to each edge e a weight we equal to the cosine similarity between the feature vectors of the documents connected by the edge. This edge weighting schema is applied for noise reduction in two ways: 1) we prune edges whose weight we is below a specified similarity threshold (S-threshold) ), and 2) we differentiate links by using higher we to promote links that are considered more important. These considerations lead to the following revised label probability for the RL algorithm: d (i) (i-1) c,d = c,d · (d),(d ) · we (2) ( Nd ) N d 2.3 Hard labeling In contrast to the presented soft labeling approach equivalent to [3], we also consider a method that takes into account only the most probable label assignments in the test (r ) document neighborhood to be significant for the c,d = Pr [ (d) = c | , N (d)](r) computation. We call this newly employed approach hard labeling. It might be seen as a crude approximation of the sum in Equation (1) but depending on the sets C and D this approximation gives very good empirical results (see Section 4.3). Our hopes for this method would be that it is more efficient and possibly more robust than the soft labeling method. Hard labeling can be thought of as a "high-level" noise reduction, since it trusts only the most probable neighborhood labeling and ignores labelings with lower probability, thus reducing the chance of an incorrectly labeled node in the neighborhood to influence the label assignment of the current test node. Let cmax be the maximum probable label for each document d N (d) as of iteration (r - 1): d = (r-1) cmax = argc max Pr c . Then considering only the maximum probable (N (d)) according to the node label probabilities of iteration (r - 1), Equation (1) can be written as: d (r-1) (r ) c,d = c,d · , c,cmax N ( d ) 2.6 Choice of basic content classifier and its simplified form, presenting the product over the set of classes rather than documents in the neighborhood N (d): (r-1) c n(cmax ) (r ) c,d = c,d · . c,cmax m ax C A Naive-Bayesian Multinomial classifier (NB) as a content based classifier is a natural choice in the MRF framework [3]. However, there are more powerful text classifiers; most notably, Support Vector Machines (SVM) are among the very best performing ones. SVM is a discriminative classifier that predicts only the class label, no label probability distribution. Methods for translating the output of SVM into probability estimates are discussed and compared in [19]. In our implementation we use the libsvm package [4] which supports probability estimates for SVM classifier based on the methods of [19]. where n (c ) is the frequency of label assignment c in N (d): n (c ) = |{d N (d) | (d ) = c }|. 3. INCORPORATING METRIC LABEL DISTANCES Intuitively, neighboring documents should receive similar class labels. For example, suppose we have a set of classes 488 C = {Culture (C), Entertainment (E), Science (S) } and we wish to find the most probable label for a test document d. Typically, documents that are related to culture or entertainment have overlapping areas of discussion like concerts, exhibitions, etc. This means, the labels C and E are close to each other. On the other hand, a document discussing scientific problems (S ) would be much farther away from both C and E . So, a similarity metric (·, ·) imposed on the set of labels C would have high values for the pair (C, E ) and small values for class pairs (C, S ) and (E , S ). Back to our example, suppose that document d has four neighbors and that these are labeled (N (d)) = {C, C, E , E }. Then, the probability that document d would be assigned class C should be higher than the same probability if the neighboring documents of d were labeled (N (d)) = {C, C, S, S }. Note that both of these neighbor labelings have the same number of C labels; the key lies in the different similarities between C and the other topic in each of the two labelings. This is why introducing a metric (·, ·) should help improve the classification result. In this metric, similar classes are separated by a shorter distance and impose smaller separation cost on an edge labeling. The theory paper by Kleinberg and Tardos [12] suggests constructing an r-HST tree approximation for the label distances, which is then plugged into their formulation of the global optimization problem. The argument for this specific approximation is tractability of the otherwise NP-hard optimization problem, but the approach loses generality. In a classification scenario the r-HST tree can not capture the similarities between classes in a realistic manner (while also preserving the triangle inequality). Our approach, on the other hand, is general, and we construct the metric automatically from the training data. The metric value for a pair of labels (ci , cj ) is computed by the content similarity between the corresponding sets of documents that were "hand-tagged" as ci or cj at training time. In our experiments, we use the term-vector based cosine similarity between the "super-documents" that concatenate all training documents of the same class. This metric is completely precomputed, and does not change as the relaxation labeling proceeds. It would also be easy to include a human supervision step, for an expert to adjust the metric values based on additional domain knowledge. We incorporate the label metric into the iterations for computing the probability of an edge labeling ci ,cj by treating (ci , cj ) as a scaling factor. This way, we magnify the impact of edges between nodes with similar labels and scale down the impact of edges between dissimilar ones: d (i) (i-1) c,d = c,d · (d),(d ) · we · ((d), (d )) (3) ( Nd ) N d The product of the and terms on the right-hand side can be viewed as the probability that d and d are thematically related, having specific labels with probability and the labels being thematically close with probability . Our experimental results (Section 4.6) show that incorporating into the relaxation labeling significantly contributes to more accurate classification results especially when the training dataset is very small. In such cases the algorithm exploits the additional "guidance" that the label metric provides for estimating the edge label probabilities in each iteration. set of classes includes "Database" (DB), "Machine Learning" (ML), and "Theory" as labels. These labels are exposed to the classifier only if the corresponding documents belong to the training data. We labeled the documents based on the conference in which they were published. For example, if a paper appeared in SIGMOD or VLDB proceedings, then it was assigned to the DB set. We chose co-authorship as the relation determining the connectivity between documents. For the initial step of purely content-based classification we use the document titles as the only source. The second dataset has been selected from the Internet movie database IMDB. For our tests we took into account all movies in which a given set of 80 famous actors occur (e.g., Johnny Depp, Bruce Willis, Mel Gibson, etc.). This resulted in about 5000 movies grouped in 4 genres: "Action", "Comedy", "Documentary", and "Drama". This dataset is challenging for automatic classification because initially many movies were "hand-tagged" with more than one genre. For our test we merged multiple genres of the same movie by using only the most "prominent" one as a "true" genre of a movie. For example, if a movie had a label Action and Science fiction, we considered it to be Action. For each movie we took its title and plot as a source of features for the initial content-based classification. We removed all stopwords and applied stemming to reduce the dimensionality of the feature space. This resulted in a feature set of around 19 000 terms. We built an edge between two documents (movies) in the graph if they have a starring actor in common. We applied similarity-based edge pruning for noise reduction. Our default S -threshold 0.25 left us with 800 nodes each of which has at least one neighbor. All "singleton" nodes without edges were disregarded in the experiments as all graph-aware methods would behave identically to contentbased classifier on these nodes. The third dataset used in the experiments was the online encyclopedia Wikipedia. We crawled about 5000 documents from the released Wikipedia dump file. As links between the pages we used their natural hyperlink connectivity. We restricted the crawler to follow only links within Wikipedia. The feature space had about 70 000 unique terms; no feature selection was performed. The distinguished classes are "Politics", "Computer Science", "Physics", "Chemistry", "Biology", "Mathematics", and "Geography". We started the crawl from the main pages for each of these sub jects and used topic-specific words in the anchor text as indicators for whether an outgoing link should be followed. For example, we started gathering documents from the page http://en.wikipedia.org/wiki/Chemistry by following links that contained manually selected word stems like "chemist", "isomer", "branch", "organic", etc. All pages gathered from the starting chemistry page were considered to be "handtagged" as Chemistry. When the same page was discovered following paths with starting points of two different seeds, the page was discarded since no decision about its "true" label without human assessment could be taken. All datasets for our experiments are available at the URL http://www.mpi-inf.mpg.de/angelova. 4.2 Experimental Setup Each of the tested classifiers was given a fraction of documents as training data to estimate its parameters. Our GC classifier used as the input parameters k, M , and the choice of method for calculating label probabilities. Parameters k and M are used in the training graph construction as follows. The classifier first picks up k random documents, seeds, from the whole set D. Once all the k documents di are selected, a subgraph induced by each document di and its neighbors up to radius M is formed, and the training set Dtrn is enhanced with all newly discovered nodes. All gathered nodes, the k seeds and their neighbors up to depth M , 4. EXPERIMENTS 4.1 Datasets We have tested our graph-based classifier on three different data sets. The first one includes approximately 16000 scientific publications chosen from the DBLP database. The 489 form the set of vertices Vtrn . The links connecting these vertices form the set Etrn . Finally, we define a training graph as Gtrn = {Vtrn , Etrn }. The training graph is used by the algorithm to learn parameters like prior probabilities of classes and term distributions in the data. After the training phase, the algorithm classifies the given test graph which is basically constructed by excluding the training graph from the available data. This construction procedure led to fairly balanced class-probability distributions on some of the datasets and to fairly skewed distributions on others. For example, on the IMDB set the class-probability distribution for the corresponding classes in the graph is: Action - 0.19, Comedy - 0.26, Documentary - 0.18, and Drama - 0.37. On the other hand, in the Wikipedia data set classes Chemistry (0.05) and Physics (0.09) are much less frequent than all other classes (e.g., class Politics - 0.22, Mathematics - 0.18). For the experiments we repeat the construction of the test and training graph with different seeds of the random number generator while keeping all initial parameters unchanged. Then, the classifier is trained on the corresponding training graph and is applied to the test graph. Average values from 30 runs of each algorithm are reported as the result of an experiment. We present the results of the graph-based classification (GC) and its variants and compare them to a content-only classification and against other neighborhood aware methods previously proposed in the literature. We present the performance of all classifiers in terms of the standard measures accuracy (A, i.e. 1-error) and F1 (i.e., the harmonic mean of recall and precision). Since the micro averaged F1 measure is usually considered most insightful in the literature [1, 11], when aggregating results over multiple classes, we always present micro-averaged results. The macro- averaged results merely confirmed our findings and are omitted for space reasons. The different algorithms under test are abbreviated as follows: · GC [H ] or GC [S ] as discussed in Section 2.4. · mGC - metric GC that uses the label distance metric in the computations as discussed in Section 3. · wGC - GC using the edge weighting scheme as in Equation 2. · wmGC - GC that takes into account the label metric (·, ·) and the edge weights we in the test graph as in Equation 3. To indicate the content-only classifier that we used as a seed for the initial iteration, Naive-Bayesian Multinomial classifier (NB) or Support Vector Machine (SVM), we write GCN and GCS , respectively. For SVM we used the libsvm package [4]; all other methods were our own implementations in Java using JDK 1.5. Unless stated otherwise, we fix parameters as shown in Table 1. Table 1: Default parameters for all three data sets. Data set Initial numb er of seeds, k Radius of expansion, M Training data size (avg.) Test data size (avg.) Avg. degree of a training no de Avg. degree of a test no de Avg. degree after S -threshold S -threshold Value of p DBLP 700 1 7326.17 12739.83 34.17 8.54 8.54 0.0 5 IMDB 500 0 500.0 4154.0 0.09 92.73 0.71 0.25 5 Wikip edia 750 0 750.0 5355.1 0.81 15.66 5.73 0.2 5 4.3 Baseline experiment We first present the performance of GC versus contentonly classifiers on the DBLP data. In this baseline experiment GC used no label similarity metric, edge weighting or pruning. For building the training graph we used radius M = 1 and different sizes of the initial set of do cuments k as seeds. The results are shown in Table 2. Due to the features space exploited for each data set as well as the nature of both content-based classifiers, NB classifier outperforms SVM on the IMDB and DBLP data sets, while SVM shows better performance than NB on the Wikipedia set. Because of space limitations, we present in detail the results derived by the better performing content-based classifier for the different data sets and briefly explain how the other one behaves on the corresponding set. As we see from Table 2, the improvement of the graphbased methods versus the content-only methods is significant, and this holds for both NB and SVM as underlying classifiers. In most cases, the GC method wins 2 to 4 percent in the F1 measure over the content-only classifiers. These are really significant gains given the already very good performance of the underlying content-only classifiers. Only for very small amounts of training data, NB and SVM slightly outperform the GC methods. We will see in Subsection 4.5 that this susceptibility to particularities of very small training sets can be overcome by using the label similarity metric. Both the hard and the soft methods perform very well; the hard method seems slightly superior. Table 2: Performance for the DBLP data set. config. [M=1] F1 NB A Performance GCN [H ] F1 A GCN [S ] F1 A k k k k = = = = 300 500 700 1100 85.82 88.01 88.86 89.80 89.46 91.03 91.57 92.22 84.11 90.41 92.00 93.08 86.71 92.18 93.55 94.49 79.99 89.20 91.32 92.80 83.05 91.10 92.97 94.26 The performance on the IMDB data set is shown in Table 3 and the results for the Wikipedia data set in Table 4. The IMDB experiment faced the particular challenge that the data had very noisy links: on average a movie node was connected to almost 100 other nodes. However, our Sthresholding technique (set to 0.25 in the IMDB case) was very successful in pruning out the noise. With the edge pruning, GC outperformed the content-only classifiers by a big margin. We provide a sensitivity analysis on the Sthreshold in Subsection 4.5. The improvement shown on the IMDB corpus confirms once more the hypothesis that even in scarcely connected graph, neighborhood aware classification achieves good improvement over the pure text-based classification. Table 3: Performance for the IMDB data set. config. [M=0] F1 NB A Performance GCN [H ] F1 A GCN [S ] F1 A k k k k = = = = 500 700 900 1100 58.08 61.10 64.17 66.93 69.13 71.13 73.43 75.60 62.77 66.75 70.48 73.43 74.58 77.65 80.71 83.24 62.76 66.76 70.27 72.97 74.57 77.53 80.57 83.13 The Wikipedia experiment reconfirms the good performance of GC. Here the variant with SVM as a basic contentonly classifier performs best. This is easily explained by the big and rich feature set of this data. SVM is well known for excelling on large feature spaces. In summary, the graph-based approach gains up to 5 percent in micro-averaged F1. The gains are most prominent for small training sets, which we consider the most important scenario. Training involves significant human labor and is time-consuming and expensive. This is why the graph-based 490 classification is an attractive method, with very good performance in situations when extensive training efforts are prohibitive. Table 4: Performance for the Wikipedia data set. config. [M=0] F1 SV M A Performance GCS [H ] F1 A GCS [S ] F1 A k k k k = = = = 500 750 1000 1250 82.87 86.09 88.20 89.32 94.80 95.94 96.74 96.99 87.41 89.35 90.22 90.93 96.28 97.03 97.34 97.53 85.09 89.11 89.91 90.65 96.08 96.92 97.19 97.42 4.4 Comparison to other neighborhood aware algorithms We compared the performance of our GC algorithm against the existing algorithms discussed in Section 1.1. The algorithm suggested by Chakrabarti et al.[3] is abbreviated as ChAlg, the one by Oh et al.[16] as OhAlg, the one by Lu and Getoor [15] as LGAlg. The reader should also note, that OhAlg takes advantage of the noise pruning strategy, based on similarity thresholding. Our implementation of LGAlg uses SVM as a basic classifier while the original paper used logistic regression, but SVM is certainly competitive to logistic regression. In Table 5 we show the results of all algorithms over the default settings of all three data sets. Note, that the microaveraged F1 is a much more meaningful measure for the classifier's performance than its accuracy. Table 5: Comparison of known algorithms vs. GC DBLP F1 NB SV M GCN [H ] GCS [H ] wmGCN [H ] wmGCS [H ] C hAlgN C hAlgS O hAlgN [ = 0.0] O hAlgS [ = 0.0] O hAlgN [ = 0.5] O hAlgS [ = 0.5] O hAlgN [ = 1.0] O hAlgS [ = 1.0] LGAlg[Binary] LGAlg[Count] LGAlg[Single] A competitors exhibit significant shortcomings on at least one of the data sets. ChAlg does not work well with IMDB and Wikipedia because of the richness but also noise of their link structures. ChAlg lacks the additional capabilities of similarity thresholding/weighting and label similarity metrics that GC can exploit. Also, ChAlg does not re-adjust its edge label probability estimator during the RL iterations and is thus very critically dependent on sufficient training information on link/class patterns. The performance of LGAlg deteriorates dramatically on DBLP. OhAlg shows poor performance on IMDB because its rationale of incorporating weighted features from neighbors overreacts to the noise in the training data. In contrast, GC does not critically depend on the link structure in the training data and thus performs well also with small training sets. One of its key strengths is that it computes the edge label probabilities dynamically, in each iteration of the RL framework. Thus, it benefits from the "up-to-date" information that has so far been learned on the test data, not just the initialization from training. Another novel idea, that was incorporated in the GC algorithm, takes advantage of a metric label similarity as a corrective measure over the predictions of the text-based classifiers when the training data size is small (wmGC in Table 5). The role of the label metric is discussed in detail in Section 4.6 and the experiments show that it is significant. 4.5 Influence of the similarity thresholding This section discusses the influence of similarity thresholding and its effectiveness for noise reduction. Table 6: Similarity threshold influence for IMDB. S -threshold value 0.00 0.05 0.15 0.25 0.35 Test No des b elow data size S - threshold 4154.0 4154.0 4154.0 4154.0 4154.0 147.9 675.5 2750.7 3394.3 3718.5 GCN [H ] F1 37.44 41.59 52.80 60.62 67.98 A 52.34 56.07 71.08 77.60 82.12 GCS [H ] F1 37.44 41.59 58.64 62.77 66.76 A 52.34 56.07 70.43 74.58 78.36 IMDB F1 A Wikip edia F1 A 88.86 88.82 92.00 91.77 92.44 92.17 91.04 90.62 89.48 90.12 91.35 90.25 91.71 89.72 81.54 81.23 88.14 91.56 91.23 93.55 93.13 93.96 93.53 92.71 92.03 91.68 91.81 93.16 91.80 91.28 93.15 85.42 86.75 90.97 58.07 57.73 62.77 61.08 62.02 61.04 17.69 15.84 56.52 56.36 58.78 59.53 59.53 59.18 68.22 65.60 35.68 69.13 69.78 74.58 77.89 74.35 77.26 41.86 41.10 72.81 73.47 76.30 77.66 77.66 79.59 83.96 82.33 74.89 81.73 86.09 82.20 89.35 82.45 89.48 82.08 69.34 85.56 86.78 88.53 84.65 88.75 79.34 86.79 86.80 88.81 93.82 95.94 93.93 97.03 93.98 97.03 93.78 87.01 94.99 95.62 96.15 95.00 96.31 94.02 96.36 96.40 96.86 GC clearly outperforms all algorithms discussed in Section 1.1. We performed Student's t-test on this data and found that our results are significant for a test level of 1 - = 0.95 or higher. Following the arguments mentioned in Section 4.3, on the Wikipedia data set, GC performs significantly better than all other competitors, when using SVM as a pure content based classifier. OhAlg performs equally well on DBLP only in the case of = 1.0 in combination with Naive Bayesian as a text-only classifier. All other improvements shown on the DBLP data are significant according to the t-test. IMDB was a challenge for all graph-based classifiers. It is very noisy and often the neighborhood information could be misleading. On this set we perform significantly better than ChAlg, OhAlg with all three variations of , and LGAlg[Single], but LG[Binary] and LG[Count] show an even better performance here. Most of the algorithms give good performance when the training data set is big because they require reliable knowledge about the term distribution (OhAlg) or the class/link pattern (ChAlg). Table 5 focuses on the case with relatively small training sets (see Table 1 for the settings). All of our This issue arose only for IMDB and Wikipedia, DBLP, on the other hand, is a well structured database with a clear and meaningful almost noise-free, link structure. For IMDB, we used common starring actors to form the link structure. Unfortunately, actors, especially in the beginning of their career, play in movies that often belong to extremely diverse movie genres. Table 6 shows the GC performance for different S -threshold values. It becomes clear that at least a similarity threshold of 0.25 is necessary to form a meaningful neighborhood around each test document from the IMDB database. Table 7: Similarity threshold influence for Wikipedia. S -threshold value 0.0 0.1 0.2 0.3 Test No des b elow data size S - threshold 5352.83 5353.6 5355.1 5356.7 0 105.4 526.17 1272.5 GCN [H ] F1 81.81 82.16 82.20 82.09 A 93.75 93.89 93.93 93.91 GCS [H ] F1 62.92 87.27 89.35 88.42 A 77.50 96.15 97.03 96.85 The Wikipedia data set contains many cross-links. For example, a page discussing the contributions of the famous chemist Friedrich Woehler contains links to years or places where he made his discoveries. Following such links, we end up on pages that are completely irrelevant to the topic of Chemistry, e.g.: on 1st of March "Antonio Garca Gutirrez's play El Trovador played for the first time". Since we are interested in neighbors that would enhance and not degrade the quality of the classification, similarity thresholding is crucial for removing such links (or reducing their weight). In Table 7 we show the results of varying the S -threshold. 491 In general, finding the optimal similarity threshold is a data-dependent tuning issue. But even with a suboptimal, yet reasonable choice, GC performs much better than the other graph-conscious methods that do not consider neighbor similarities and are thus very susceptible to neighborhood noise. Automatic tuning of the S -threshold is an interesting topic for future work. 5. CONCLUSION The presented GC method for graph-based classification is a way of exploiting context relationships of data items. These kinds of models are not new, but our method introduces a variety of novel techniques that lead to a much improved level of robustness and classifier accuracy: similaritybased neighbor pruning and edge weighting to reduce neighborhood noise, class label similarity for further weight adjustment, and the ability of GC to dynamically adjust the edge label probability estimates during the RL iterations. These are salient and unique properties of the GC algorithm. Our extensive experiments clearly provide strong evidence for the superiority of the GC method. Some of our techniques could be integrated into prior methods, too, and would improve them. However, our experiments showed that the full combination of techniques that constitute our GC algorithm performed best. Our aim has been to develop a particularly robust method, and we believe that GC has achieved this goal to a large extent. In situations when neighborhood information is rich and strongly indicative (e.g., in the Wikipedia test case), GC exploits it to boost classification accuracy. In other situations where the neighborhood information is too noisy to be useful (e.g., in the IMDB test case), GC has built-in throttling capabilities that automatically regulate the degree to which GC considers neighborhood information. Incorporating metric distances among different labels contributed to the very good performance of the GC method. This is one new form of exploiting knowledge about the relationships among category labels and thus the structure of the classifier's target space. We plan to further pursue this line of research, by investigating additional forms of such background knowledge. 4.6 Influence of the label similarity metric We show how the micro-averaged F1 score is influenced by the similarity metric induced over the set of possible labels, (·, ·). The metric for the Wikipedia data is shown in Table 8. For DBLP, the similarity value was 0.5131 between labels "DB" and "Theory", 0.5114 between "DB" and "ML", and 0.5748 between "Theory" and "ML". Table 8: Lab el similarity metric for Wikip edia. (·, ·) C.Sc. Biology Geogr. Chem. Math. Physics Politics C.Sc. Biology Geogr. Chem. Math. Physics Politics 1 .3362 .2058 .3065 .4265 .3945 .3042 .3362 1 .2068 .311 .2444 .2922 .2558 .2058 .2068 1 .2039 .1657 .22 .3531 .3065 .311 .2039 1 .2754 .4335 .2056 .4265 .2444 .1657 .2754 1 .4792 .2142 .3945 .2922 .22 .4335 .4792 1 .2485 .3042 .2558 .3531 .2056 .2142 .2485 1 Table 9 and 10 show results for DBLP and Wikipedia with and without label similarity, and this is combined with edge weighting or without edge weighting. IMDB results are not shown here for lack of space. For all datasets, incorporating in the computation causes significant improvement in the GC performance. The impact is more prominent when the training data size is small and decreases gradually when the size of training data is increased. The predictive power of the GC is influenced by two factors: the content based initialization and the iterative scheme for finding the best possible labeling of the given test graph. Typically, when the training data size is small the content based classifiers are not very confident in their probabilistic estimates, which leaves space for the metric (·, ·) to lead the GC by gently imposing constraints on the separations cost estimates. Table 9: Lab el similarity influence for DBLP. k SV M GC GCS [H ] mGC wGC wmGC GC GCS [S ] mGC wGC wmGC 6. REFERENCES [1] S. Chakrabarti. Mining the Web: Discovering Know- ledge from Hypertext Data. Morgan-Kauffman, 2002. [2] S. Chakrabarti. Breaking through the syntax barrier: Searching with entities and relations. In ECML, 9­16, 2004. [3] S. Chakrabarti, B. E. Dom, and P. Indyk. Enhanced hyp ertext categorization using hyp erlinks. In SIGMOD'98, 307­318 [4] C.-C. Chang, C.-J. Lin. Libsvm: a Library for Supp ort Vector Machines. 2001. [5] W. W. Cohen, S. Sarawagi. Exploiting dictionaries in named entity extraction: combining semi-markov extraction pro cesses and data integration metho ds. In KDD, 89­98, 2004 [6] D. Cohn and T. Hofmann. The missing link - a probabilistic mo del of do cument content and hyp ertext connectivity. In NIPS 13, 2001. [7] P. Domingos and M. Richardson. Mining the network value of customers. In KDD, 57­66, 2001. [8] D. Eppstein. Finding the k shortest paths. In IEEE Symp. on Foundations of Computer Science, 154­165, 1994. [9] R. Feldman and H. Shatkay. Link analysis for bioinformatics: Current state of the art, PSB Tutorial, 2003. [10] L. Geto or. Link mining: a new data mining challenge. SIGKDD Explor. Newsl., 5(1):84­89, 2003. [11] J. Han and M. Kamb er. Data Mining: Concepts and Techniques. Morgan Kaufmann, Septemb er 2000. [12] J. Kleinb erg, E. Tardos. Approximation algorithms for classification problems with pairwise relationships: Metric lab eling and markov random fields. In FOCS, 1999. [13] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic mo dels for segmenting and labeling sequence data. In Proc. of ICML, 282­289, 2001. [14] S. Z. Li. Markov random field modeling in image analysis. Springer-Verlag New York, Inc., 2001. [15] Q. Lu, L. Geto or. Link-based classification. ICML, 2003. [16] H.-J. Oh, S. H. Myaeng, and M.-H. Lee. A practical hypertext catergorization metho d using links and incrementally available class information. SIGIR, 264­271, 2000. [17] L. Pelkowitz. A continuous relaxation lab eling algorithm for markov random fields. ACM Press, 20:709­715, 1990. [18] T. Washio and H. Moto da. State of the art of graph - based data mining. SIGKDD Explor. Newsl., 5(1):59­68, 2003. [19] T.-F. Wu, C.-J. Lin, and R. C. Weng. Probability estimates for multi-class classification by pairwise coupling. J. of Machine Learning Research, 5:975­1005, 2004. 300 500 700 1100 86.35 87.92 88.90 89.93 88.14 91.01 91.73 92.70 89.74 91.63 92.07 92.71 89.00 91.32 91.94 92.74 90.22 91.83 92.19 92.79 80.43 88.27 90.36 92.17 87.84 90.79 92.02 92.59 84.74 89.62 91.70 92.57 88.87 91.23 92.21 92.73 We observe the same degree of influence for both SVM and NB as initial content-based classifiers. We show results only for SVM. In Table 9 we show results for the DBLP data set. There is a significant positive influence of the label distance metric. The role of the metric is especially prominent in the results of the GC [S ] variant which would be more susceptible to poor initialization from the content-based classifier. The influence of edge weighting is less significant but still helps improving the GC performance. Table 10: Lab el similarity influence for Wikip edia. k SV M GC GCS [H ] mGC wGC wmGC GC GCS [S ] mGC wGC wmGC 300 500 750 1250 77.52 82.87 86.09 89.32 81.77 87.41 89.35 90.93 83.11 87.60 89.48 90.94 82.38 87.47 89.47 90.97 83.73 87.67 89.48 90.97 78.03 85.09 89.11 90.64 81.57 87.15 89.32 90.75 78.40 85.20 88.87 90.80 82.08 87.50 89.53 90.84 The results for Wikipedia dataset are shown in Table 10. We note the same tendency: the label distance metric (·, ·) is a very powerful "guide" especially when the training data size is small. 492