Random Spanning Trees and the Prediction of Weighted Graphs Nicol` Cesa-Bianchi o DSI, Universit` di Milano, Italy a Claudio Gentile DICOM, Universit` dell'Insubria, Italy a Fabio Vitale DSI, Universit` di Milano, Italy a Giovanni Zappella DSI, Universit` di Milano, Italy a cesa-bianchi@dsi.unimi.it claudio.gentile@uninsubria.it fabio.vitale@unimi.it giovanni.zappella@studenti.unimi.it This paper focuses on the online version of the graph classification problem: The entire graph is known in advance and, at each step, the algorithm is required to predict the label of a new arbitrarily chosen node. In the special case of unweighted graphs (where all edges have unit weight) a key parameter for controlling the number of prediction mistakes is the size of the cut induced by the unknown adversarial labeling of the nodes. Although in the unweighted case previous studies use the cutsize to prove several interesting upper bounds (Herbster et al., 2009a; Herbster & Pontil, 2007; Herbster et al., 2009b), no general lower bounds on the number of prediction mistakes are known, leaving fully open the question of characterizing the complexity of learning a labeled graph. In a recent paper, Cesa-Bianchi et al. (2009) bound the expected number of mistakes by the cutsize of a random spanning tree of the graph, a quantity typically much smaller than the cutsize of the whole graph. In this paper we show that this quantity captures the hardness of the graph learning problem. Given any weighted graph, we prove that any prediction algorithm must err on a number of nodes which is at least as big as the cutsize of the graph's random spanning tree (which is defined in terms on the graph's weights). Moreover, we exhibit a simple algorithm achieving the optimal mistake bound to within logarithmic factors on any sufficiently connected weighted graph whose weighted cutsize is not an overwhelming fraction of the total weight. Following Cesa-Bianchi et al. (2009), our algorithm first extracts a random spanning tree of the original graph. Then, it predicts all nodes of this tree using a generalization of the method proposed by Herbster et al. (2009a). Our tree prediction procedure is extremely efficient: it only requires constant amortized time per prediction and space linear in the number of Abstract We show that the mistake bound for predicting the nodes of an arbitrary weighted graph is characterized (up to logarithmic factors) by the cutsize of a random spanning tree of the graph. The cutsize is induced by the unknown adversarial labeling of the graph nodes. In deriving our characterization, we obtain a simple randomized algorithm achieving the optimal mistake bound on any weighted graph. Our algorithm draws a random spanning tree of the original graph and then predicts the nodes of this tree in constant amortized time and linear space. Experiments on real-world datasets show that our method compares well to both global (Perceptron) and local (label-propagation) methods, while being much faster. 1. Introduction A widespread approach to the solution of classification problems is representing the data through a weighted graph in which edge weights quantify the similarity between data points. This technique for coding input data has been applied to several domains, including Web spam detection (Herbster et al., 2009b), classification of genomic data (Tsuda & Sch¨lkopf, 2009), o face recognition (Chang & Yeung, 2006), and text categorization (Goldberg & Zhu, 2004). In most applications, edge weights are computed through a complex data-modelling process and convey crucially important information for classifying nodes. Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). Random Spanning Trees and Weighted Graphs nodes. Note that computational efficiency is a central issue in practical applications where the involved datasets can be very large. In such contexts, learning algorithms whose time complexity scales, say, more than quadratically with the number of data points should be considered impractical. A further significant contribution of this work is the experimental evaluation of our method, as compared to methods recently proposed in the literature on graph prediction. In particular, we compare our algorithm to the Perceptron algorithm with Laplacian kernel by Herbster & Pontil (2007); Herbster et al. (2009b), and to the label propagation algorithm by Zhu et al. (2003), including its online version. The experiments have been carried out on four medium-sized real-world datasets. The two tree-based algorithms (ours and the Perceptron algorithm) have been tested using spanning trees generated in various ways. Though not yet conclusive, our experimental comparison, shows that our online algorithm compares well to all competitors while using, in most cases, the least amount of time and memory resources.1 (G, y). The cutsize G (y) of (G, y) is the number of -edges in E , i.e., G (y) = E (independent of the edge weights). The weighted cutsize W (y) of (G, y) G is W (y) = (i,j)E wi,j . G W Fix (G, y). Let ri,j be the effective resistance between nodes i and j of G. In the interpretation of the graph as an electric network where the weights wi,j are the W edge conductances, the effective resistance ri,j is the voltage between i and j when a unit current flow is maintained through them. For (i, j) E, let also W pi,j = wi,j ri,j be the probability that (i, j) belongs to a random spanning tree T --see, e.g., (Lyons & Peres, 2009). Then we have E T (y) = (i,j)E pi,j = (i,j)E W wi,j ri,j (1) 2. Preliminaries and basic notation Let G = (V, E, W ) be an undirected, connected, and weighted graph with n nodes and positive edge weights wi,j > 0 for (i, j) E. A labeling of G is any assignment y = (y1 , . . . , yn ) {-1, +1}n of binary labels to its nodes. We use (G, y) to denote the resulting labeled weighted graph. The online learning protocol for predicting (G, y) is defined as follows. The learner is given G while y is kept hidden. The nodes of G are presented to the learner one by one, according to an unknown and arbitrary permutation i1 , . . . , in of V . At each time step t = 1, . . . , n node it is presented and the learner must predict its label yit . Then yit is revealed and the learner knows whether a mistake occurred. The learner's goal is to minimize the total number of prediction mistakes. It is reasonable to expect that prediction performance should degrade with the increase of "randomness" in the labeling. For this reason, our analysis of graph prediction algorithms bounds from above the number of prediction mistakes in terms of appropriate notions of graph label regularity. A standard notion of label regularity is the cutsize of a labeled graph, defined as follows. A -edge of a labeled graph (G, y) is any edge (i, j) such that yi = yj . Similarly, an edge (i, j) is free if yi = yj . Let E E be the set of -edges in Due to space limitations, all proofs are omitted from this extended abstract. The omitted material can be found in the longer version (Cesa-Bianchi et al., 2010). 1 where the expectation E is over the random choice of spanning tree T . Since (i,j)E pi,j is equal to n - 1, irrespective of the edge weighting, we have 0 1 E T (y) n - 1. Hence the ratio n-1 E T (y) [0, 1] provides a density-independent measure of the cutsize in G, and even allows to compare labelings on different graphs. It is also important to note that E T (y) can be much smaller than W (y) when there are strongly G connected regions in G contributing prominently to the weighted cutsize. To see this, consider the following scenario: If (i, j) E and wi,j is large, then (i, j) gives a big contribution to2 W (y). However, G this might not happen in E T (y). In fact, if i and j are strongly connected (i.e., if there are many disjoint W paths connecting them), then ri,j is very small and the W terms wi,j ri,j in (1) are small too. Therefore, the effect of the large weight wi,j may often be compensated by the small probability of including (i, j) in the random spanning tree. 3. A general lower bound We start by stating a general lower bound, showing that any prediction algorithm must err at least 1 2 E T (y) times on any weighted graph. Theorem 1 Let G = (V, E, W ) be a weighted undirected graph with n nodes and weights wi,j > 0 for (i, j) E. Then for all K n there exists a randomized labeling y of G such that for all (deterministic or randomized) algorithms A, the expected number of prediction mistakes made by A is at least K/2, while E T (y) < K. 2 It is easy to see that in such cases W (y) can be much G larger than n. Random Spanning Trees and Weighted Graphs 4. The Weighted Tree Algorithm We now describe the Weighted Tree Algorithm (wta) for predicting the labels of a weighted tree. In Section 5 we show how to apply wta to the more general weighted graph prediction problem. wta first turns the tree into a line graph (i.e., a list), then runs a fast nearest neighbor method to predict the labels of each node in the line. Though this technique is similar to that one used in (Herbster et al., 2009a), the fact that the tree is weighted makes the analysis significantly more difficult, and the practical scope of our algorithm significantly wider. Our experimental comparison in Section 7 confirms that exploiting the weight information is often beneficial in graph prediction. Given a labeled weighted tree (T, y), the algorithm initially creates a weighted line graph L containing some duplicates of the nodes in T . Then, each duplicate node (together with its incident edges) is replaced by a single edge with a suitably chosen weight. This results in the final weighted line graph L which is then used for prediction. In order to create L from T , wta performs the following tree linearization steps: 1. An arbitrary node r of T is chosen, and a line L containing only r is created. 2. Starting from r, a depth-first visit of T is performed. Each time an edge (i, j) is traversed (even in a backtracking step), the edge is appended to L with its weight wi,j , and j becomes the current terminal node of L . Note that backtracking steps can create in L at most one duplicate of each edge in T , while nodes in T may be duplicated several times in L . 3. L is traversed once, starting from terminal r. During this traversal, duplicate nodes are eliminated as soon as they are encountered. This works as follows. Let j be a duplicate node, and (j , j) and (j, j ) be the two incident edges. The two edges are replaced by a new edge (j , j ) having weight3 wj ,j = min wj ,j , wj,j . Let L be the resulting line. The analysis of Section 4.1 shows that this choice of wj ,j guarantees that the weighted cutsize of L is smaller than twice the weighted cutsize of T . Once L is created from T , the algorithm predicts the label of each node it using a nearest-neighbor rule operating on L with a resistance distance metric. That is, the prediction on it is the label of is , being s = argmins