On learning with kernels for unordered pairs Martial Hue Martial.Hue@mines-paristech.fr Jean-Philippe Vert Jean-Philippe.Vert@mines-paristech.fr Mines ParisTech, Centre for Computational Biology, 35 rue Saint Honor´, F-77300 Fontainebleau, France e Institut Curie, F-75248, Paris, France, INSERM, U900, Paris, F-75248, France Abstract We propose and analyze two strategies to learn over unordered pairs with kernels, and provide a common theoretical framework to compare them. The strategies are related to methods that were recently investigated to predict edges in biological networks. We show that both strategies differ in their loss function and in the kernels they use. We deduce in particular a smooth interpolation between the two approaches, as well as new ways to learn over unordered pairs. The different approaches are tested on the inference of missing edges in two biological networks. marketing to predict social interactions between persons from data about individuals. From a machine learning perspective, this problem can be formulated as a binary supervised classification problem over unordered pairs of genes (we only consider undirected interactions here). Let us denote by A the space to represent the individual proteins, typically a Hilbert space associated to a reproducing kernel. A difficulty that must be addressed, then, is that of representing an unordered pair of points {a, b} from representations of individual points a and b in A. In particular, while an ordered pair (a, b) is naturally an element of the product space A × A,which inherits a Hilbert space structure from that of A, the problem of invariance by permutation must be addressed when unordered pairs {a, b} are considered. Some authors have proposed to use the natural representation or ordered pairs in the product space, either ignoring the invariance constraint (Bock & Gough, 2001) or enforcing it by symmetrizing the representation (Ben-Hur & Noble, 2005; Martin et al., 2005). Others have prefered to rephrase the problem as learning over ordered pairs to estimate a non-symmetric function, and enforce the invariance in the predictions by averaging a posteriori the predictions made on symmetric pairs (a, b) and (b, a) to make a prediction for the unordered pair {a, b} (Bleakley et al., 2007). Here we wish to clarify the theoretical relationships between these last two approaches, i.e., symmetrizing the representation of ordered pair before learning, or learning over ordered pairs and symmetrizing the prediction a posteriori. We show that, after a certain reformulation, they only differ in the loss function employed and in the representation for ordered pairs they use. This allows us to propose a whole family of new methods, and in particular to smoothly interpolate between both approaches. Learning over unordered pairs from a description or ordered pairs is a particular instance of multi-instance 1. Introduction This work is motivated by a promising line of research that has attracted some interest recently, namely, using machine learning and in particular kernel methods to predict interactions in biological networks from genomic data (Bock & Gough, 2001; Yamanishi et al., 2004; Ben-Hur & Noble, 2005; Vert & Yamanishi, 2005; Kato et al., 2005; Martin et al., 2005; Bleakley et al., 2007). The problem is, given a list of genes or proteins known to interact or not, to predict whether new pairs interact or not. For that purpose, data about the individual proteins are available, such as their amino acid sequence or their expression levels across many experimental conditions. This problem has many applications in systems biology, since it could allow us to systematically increase our knowledge of complex biological processes from the wealth of data that can be produced by recent high-throughput technologies. Learning pairwise relationships has also many possible applications beyond biology, e.g., in sociology and Appearing in Proceedings of the 27 th International Conference on Machine Learning, Haifa, Israel, 2010. Copyright 2010 by the author(s)/owner(s). On learning with kernels for unordered pairs learning, where one learns over sets of points. Indeed, an unordered pair {a, b} can be seen as a subset of two ordered pairs {(a, b), (b, a)}. Most results in this paper can be generalized to this more general settings of multiple instance (MI) learning (Dietterich et al., 1997), in particular one of the two strategies we investigate corresponds to the MI kernel proposed by G¨rtner et al. a (2002) in the context of MI learning. We therefore state below the formalization of the problem and the main results in the general context of MI learning, and deduce particular results when the equivalence classes are unordered pairs. X = A2 p1 (a, b) (b, a) p2 (c, d) (d, c) 2. Setting and notations Let X be a set endowed with a positive definite (p.d.) kernel K : X ×X R. Much research in statistics and machine learning has been devoted to the inference of a function f : X R from a set of n observations of input/output pairs S = (xi , yi )i=1,...,n , where for each i = 1, . . . , n, xi X and yi R. In particular, a successful line of research has been to estimate a function f in the reproducing kernel Hilbert space (RKHS) H associated with the kernel K as the solution of the following optimization problem: 1 f H n min n Figure 1. Illustration of our notations. A is the space of individual vertices (genes or proteins). X = A2 is the set of ordered pairs of vertices. P is a partition of X , where each element p P is a set of two symmetric ordered pairs, such as p1 = {(a, b), (b, a)} and p2 = {(c, d), (d, c)}. 3. Two strategies to learn over P We investigate two strategies to learn over P. The first approach is to define a p.d. kernel over P using the one existing over X , and to solve the inference problem over P using a classical kernel method. The second is to transform the inference problem over P into an inference problem over X , and to solve the latter using a classical kernel method with the p.d. kernel K. We now describe each approach in more detail, keeping the general formalism of MI learning. Strategy 1 : Inference over P (pair kernel) The first approach is to derive a kernel KP over P from the kernel K over X . In the case of pairs, this amounts to defining a kernel over unordered pairs from a kernel over ordered pairs, which we call a symmetric pair kernel. Among the kernels for sets of points that have been investigated in the past, we consider the popular average: 1 |p| · |p | (yi , f (xi )) + ||f ||2 , H i=1 (1) where : R2 R is a loss function convex in its second argument, ||f ||H is the norm of f in the RKHS H, and is a regularization parameter (Sch¨lkopf & Smola, o 2002; Shawe-Taylor & Cristianini, 2004). In this paper we consider the situation where we infer a function not over X , but over a set P of finite subsets of X . This is the typical scenario of MI learning. Each element p P is a finite subset of |p| points in X , i.e., p = x1 , . . . , x|p| with xi X for i = 1, . . . , |p|. We are given a series of n observations of input/output pairs (pi , yi )i=1,...,n , where for each i = 1, . . . , n, pi P and yi R. We note that, contrary to the classical situation presented in the previous paragraph, the kernel K is on X while the inference problem is on P. As explained in the introduction, this general MI learning formulation covers in particular the problem of learning over unordered pairs when a kernel over ordered pairs is available: X represents then the set of ordered pairs of the form x = (a, b), i.e., X = A2 , where A is the set of individuals, while P is the set of pairs of ordered pairs obtained by permutation, i.e., p = {(a, b), (b, a)}. Elements of P can therefore be thought as unordered pairs. Figure 1 illustrates the representation. KP (p, p ) = K(x, x ) . xp,x p (2) KP is a particular instance of a convolution kernel (Haussler, 1999) and is therefore p.d. over P. It was proposed as a kernel for MI learning by G¨rtner et al. a (2002). We denote by HP its RKHS. We can then estimate a function f : P R from the examples (pi , yi )i=1,...,n by solving the following optimization On learning with kernels for unordered pairs problem in HP : f = arg min f HP 1 n n 4.1. The tensor product pairwise kernel (TPPK) approach (yi , f (pi )) + f 2 HP , (3) i=1 where is a regularization parameter. This amounts to a classical machine learning method (1) with the MI kernel (2). Strategy 2 : Inference over X (pair duplication) The second strategy is to infer a function over X , and transform it into a function that is invariant over each subset in P a posteriori. For that purpose, we first expand each training example (p, y) of P ×Y into several pairs of X × Y by taking all points x p and assigning them the label y. That is, if pi = {x1 , . . . , xni }, then i i we expand the training set S over P × Y to make a new training set S over X × Y: S = {(xj , yi ), i i = 1, . . . , n; j = 1, . . . , ni } X × Y . Given a kernel KA over the set of individuals A, (BenHur & Noble, 2005; Martin et al., 2005) define the following tensor product pairwise kernel between unordered pairs: KT P P K ({a, b} , {c, d}) =KA (a, c)KA (b, d) + KA (a, d)KA (b, c) . Then they propose to infer a function over unordered pairs P by applying a kernel method over P with the TPPK kernel, i.e., by solving (3) with the kernel (7). We now state the formal equivalence of this approach with both strategies: Proposition 1. Let X = A2 be endowed with the p.d. kernel: K ((a, b), (c, d)) = 2KA (a, c)KA (b, d) . (8) (7) We note that when p represents an unordered pair {a, b}, the expansion amounts to duplicate it into two ordered pairs x = (a, b) and x = (b, a), both labeled with the same label as p. We then infer a function over X from this training set by solving the problem g = arg min gH Then the TPPK approach is equivalent to both Strategy 1 and Strategy 2 , in the sense that they all give the same solution. The equivalence between the TPPK approach and Strategy 1 is a simple consequence of the following equality between the TPPK kernel and the kernel KP defined over P in Strategy 1 : 1 KP ({a, b} , {c, d}) = [K((a, b), (c, d)) + K((a, b), (d, c)) 4 + K((b, a), (c, d)) + K((b, a), (d, c))] =KA (a, c)KA (b, d) + KA (a, d)KA (b, c) =KT P P K ({a, b} , {c, d}) . The equivalence of the TPPK approach with Strategy 2 is less obvious at first sight, and is proved in the Appendix. In practice, this equivalence between the TPPK approach and Strategy 2 does not mean that we should solve the problem with Strategy 2 , since the latter involves estimating the classifier from more training examples than the former. This equivalence is useful since it will allow us in section 6 to propose a smooth interpolation between the TPPK approach and another successful approach for inference over unordered pairs, the local model, reviewed below, by interpolating between kernels in Strategy 2 . 4.2. The local model approach To address the same problem of inferring edges in biological networks, (Bleakley et al., 2007) propose to learn independently local subnetworks and merge them for prediction. More precisely, for each individual a A, its set of mates in the unordered pairs of 1 n n L2 (g, pi , yi ) + g i=1 2 H , (4) where the loss function associated to a subset p P is expanded over its elements according to: L2 (g, p, y) = 1 |p| (y, g(x)) . xp (5) In practice, this means that we just estimate the function g using a classical kernel learning method (1) over S (up to a weighting of the loss associated to each point inversely proportional to the size of the subset p it belongs to). Finally, we obtain a function over P by averaging the values g takes over the elements of each subset, i.e., the final predictor is fg : P R given by p P, fg (p) = 1 |p| g (x) . xp (6) 4. Relation to previous work These two strategies cover several successful approaches that have been proposed to infer undirected interactions in biological networks, and which we now review. In this section we therefore consider the case where X = A2 is the set of ordered pairs of elements of a set A (e.g., proteins), and P is the set of unordered pairs of the form p = {(a, b), (b, a)} for a, b A. On learning with kernels for unordered pairs the training set is collected. If a pair {a, b} has the label y in the training set, then the mate b is given the label y. From the set of labeled mates, one estimates the function ga : A R with a kernel method (using the kernel KA on A). ga represents then the local subnetwork inferred near a, in the sense that ga (b) > 0 means that b is predicted to interact with a. Finally, once the functions ga are estimated for all a A, we obtain a prediction flocal : P R for a new unordered pair p = {c, d} by averaging the local models in c and d as follows: flocal ({c, d}) = gc (d) + gd (c) . 2 sets into labeled points, learn over the labeled points with the loss function (10), and then make a prediction over a set as the average prediction over the points it contains. Strategy 2 follows the same steps, but with a different loss function. It is instructive to compare the losses L1 and L2 used by both strategies. Since the loss function (y, u) is convex in its second argument, we immediately get that: L1 (g, p, y) L2 (g, p, y) . Hence L2 is a surrogate of L1 , which implies that the solution found by Strategy 2 , which has a small average L2 loss, must also have a small average L1 loss. In a sense, Strategy 2 is more conservative than Strategy 1 in the sense that it penalizes a bad prediction for any element of a set p, while Strategy 1 only penalizes the average prediction over p. This suggests that Strategy 2 may be better adapted to situations where one expects all individuals within a subset p to behave similarly, while Strategy 1 may be better suited to problems where only a few elements of a subset are sufficient to characterize the class of the subset, e.g., in multi-instance learning (G¨rtner et al., 2002). a Finally, Theorem 1 suggests that new algorithms may be investigated by modifying the loss function (5) employed in Strategy 2 , besides L1 and L2 . For example, in order to be even more conservative than L2 and penalize the worse prediction within a subset instead, one could consider the following loss function: L (g, p, y) = max (y, g(x)) xp Not surprisingly, this local approach is a particular case of Strategy 2 with a particular kernel: Proposition 2. Let X = A2 be endowed with the p.d. kernel: K ((a, b), (c, d)) = (a, c)KA (b, d) , (9) where is the Kronecker kernel ((a, c) = 1 if a = c, 0 otherwise). Then the local approach is equivalent to Strategy 2 , in the sense that they give the same solution. The proof of this statement is postponed to the Appendix. It should be noted that, in contrast to the TPPK approach and the kernel (8), Strategy 1 and Strategy 2 are not equivalent with the kernel (9). 5. Comparison of Strategy 1 and Strategy 2 In this Section we come back to the general MI learning setting of learning over a general set of subsets P. While Strategy 1 and Strategy 2 are quite different in their presentation, the following results, whose proof is postponed to the Appendix, shows how to interpret them in a unified framework and highlight their similarity and difference. Theorem 1. The solution (3) of Strategy 1 is also the solution of Strategy 2 when the loss function (5) is replaced by: L1 (g, p, y) = y, 1 |p| g(x) xp We leave the investigation of this and other new formulations to future work. 6. Interpolating between the TPPK and local approaches We have shown that the TPPK approach of (Ben-Hur & Noble, 2005; Martin et al., 2005) is equivalent to both Strategy 1 and Strategy 2 with the particular kernel (8) over ordered pairs, while the local approach of (Bleakley et al., 2007) is equivalent to Strategy 2 with the kernel (9). While both approaches have been applied separately, this common framework provides opportunities to mix them by smoothly interpolating between them. More precisely, let us consider the following kernel on ordered pairs, for [0, 1]: K ((a, b), (c, d)) = [(a, c) + (1 - )KA (a, c)]·KA (b, d) . (11) Using K 0 with Strategy 1 and Strategy 2 gives rise to the TPPK approach. Using K 1 with Strategy 2 is . (10) This results shows that the difference between Strategy 1 and Strategy 2 can be thought as a difference between the loss functions they use. More precisely, learning with a convolution kernel (2) over labeled sets in P is thus formally equivalent to expand the labeled On learning with kernels for unordered pairs the local approach. Varying between 0 and 1, and using K with Strategy 2 therefore smoothly interpolates between the TPPK and local approaches. Using K with Strategy 1 for 0 1 provides a new set of solutions, starting from the TPPK solution for = 0, which combine the less conservative loss function L1 with a kernel less symmetric than TPPK. In terms of computation requirements, Strategy 1 requires solving a pattern recognition problem with n training examples, while Strategy 2 solves a similar problem with 2n examples. Strategy 1 with K 1 (the local approach) benefits from an important computational advantage, since the learning problem decomposes as local and uncoupled problems which can be solved separately. If the n training pairs involve m proteins, each present in n/m pairs, then the local model must solve m problems with n/m points. Since the complexity of learning is typically at least quadratic in the number of points, we typically gain a factor of m over Strategy 1 with other kernels. experiment to discriminate true edges from decoys with a SVM, using the libsvm implementation with a custom kernel. For each split, the regularization parameter of the SVM (C) was chosen on the grid {1, 2, 4, 8, 16, 32} by maximizing the mean area under the ROC curve (AUC) on an internal 5-fold crossvalidation on the training set. Using this common procedure, we compared the various kernels K in (11) for 11 values of uniformly spaced in the interval [0, 1], and the two approaches Strategy 1 and Strategy 2 . Table 1 shows, for each of the 9 experiments, which configuration reaches the best performance. The main message is that there is no clear winner, and that it may therefore be useful to consider various strategies for a given problem. More precisely, for six datasets, Strategy 2 performs equally as or better than Strategy 1 uniformly on [0, 1], while Strategy 1 performs equally as or better than Strategy 2 in the remaining three datasets. In five cases, the maximal AUC is attained at strictly between 0 and 1. In these cases, the improvement over T P P K and the local model is significant for the experiments "interaction, expression" and "metabolic, expression", corresponding to cases where the newly proposed interpolation significantly outperforms both existing methods, with a p-value < 10-3 . Figures 2 shows three cases of the various patterns we can observe when we investigate the performance of both strategies as a function of . As expected, the performance of Strategy 1 and Strategy 2 coincide at = 0, which corresponds to the TPPK approach. The relative performance of both strategies, and their performance as a function of greatly depends on the dataset. Table 1. Strategy and kernel realizing the maximum mean AUC for nine metabolic and protein-protein interaction networks experiments, with the kernel K for [0, 1]. 7. Experiments To evaluate the performance of the different formulations we test them on two benchmark biological networks used in previous studies (Yamanishi et al., 2004; Vert & Yamanishi, 2005; Kato et al., 2005; Bleakley et al., 2007). Both networks represent graphs of proteins of the budding yeast S. cerevisiae. The first network is the metabolic network, which connects pairs of enzymes which catalyse successive reactions in the known biochemical networks. This graph contains 668 vertices and 2782 undirected edges. The second network is the protein-protein interaction (PPI) network of (von Mering et al., 2002) where only high-confidence interactions are kept. This graph contains 984 vertices and 2438 edges. To predict the edges in both network, we use several kernels between proteins (vertices), which encode various biological information. For the metabolic network, we use three basic kernels based on expression data, cellular localization, and phylogenetic profiles, as well as a kernel that combines them as a sum. For the PPI network, we used four kernel based on expression data, localization, phylogenetic profiles, and yeast two-hybrid experiments, as well as their combination with a sum. All data are available from the supplementary information of (Yamanishi et al., 2004) 1 . For each network, we randomly generated nine times as many negative pairs (decoys) as known edges. We then performed a 5-fold cross-validation classification Available at ~yoshi/ismb04 1 benchmark interaction, exp interaction, loc interaction, phy interaction, y2h interaction, integrated metabolic, exp metabolic, loc metabolic, phy metabolic, integrated best kernel Duplicate, = 0.7 Pair kernel, = 0.6 Duplicate, = 0.8 Duplicate / Pair kernel, = 0 Duplicate / Pair kernel, = 0 Pair kernel, = 0.6 Pair kernel, = 1 Pair kernel, = 0.6 Duplicate / Pair kernel, = 0 8. Conclusion We proposed a theoretical analysis of two strategies that were proposed recently to learn over pairs. We http://web.kuicr.kyoto-u.ac.jp/ On learning with kernels for unordered pairs have shown that they can be compared in terms of loss function, and in terms of kernel used to represent a directed pair. We have derived from this analysis a family of new methods which interpolate between both methods, and have shown on real data that the different strategies can be relevant on different datasets. The problem of automatically finding the good strategy and the good parameters for a given dataset remains to be investigated, as well as the relevance of the methods to more general MI learning problems. Appendix Proof of Proposition 1 To show the equivalence between the TPPK approach and Strategy 2 with the product kernel (8), let us first rewrite the TPPK approach as an optimization problem. Let HA be a Hilbert space and A : A HA be such that, for all a, b A, KA (a, b) = A (a), A (b) HA . Then we can express a product kernel as an inner product in H2 = HA HA as follows: KA (a, c)KA (b, d) = 1 (a, b), 1 (c, d) H2 , (12) where 1 : X H2 is defined by 1 (a, b) = A (a) A (b) . For any {a, b} P, let now: 1 (a, b) + 1 (b, a) . (13) 2 By definition of the TPPK kernel (7) we easily see that it can be rewritten as an inner product in H2 : 2 ({a, b}) = KT P P K ({a, b} , {c, d}) = 2 ({a, b}), 2 ({c, d}) H2 . We therefore deduce that the TPPK method can be expressed as estimating the function p P f (p) = W , 2 (p) H2 , where W solves the following optimization problem: 1 min W H2 n n yi , W, 2 (pi ) i=1 H2 + ||W ||2 2 . H (14) Let us now show that Strategy 2 with the product ker nel (8) estimates the same function. Let 0 = 21 . Then, by (12), we can express the kernel (8) as: 2KA (a, c)KA (b, d) = 0 (a, b), 0 (c, d) H2 . Figure 2. Mean AUC for the metabolic, loc (up), proteinprotein interaction, exp (middle) and interaction, integrated (down) networks experiment, with the kernel K for [0, 1]. The red curves (Duplication) correspond to Strategy 2 , while the black curves (Pair kernel) correspond to Strategy 1 . Therefore we can express the inference in Strategy 2 as an optimization problem in H2 : W H2 min 1 n n [ i=1 yi , W, 0 (ai , bi ) 2 H2 + yi , W, 0 (bi , ai ) H2 ] 2 +||W ||2 2 . H (15) On learning with kernels for unordered pairs For any W H2 , seen as a Hilbert-Schmidt operator by isometry, let W H2 be its adjoint, characterized by W, u v H2 = W , v u H2 . This implies, for any a, b A, W, 0 (a, b) H2 By the representer theorem, the solution g belongs to the following linear subspace of H: F = span({K((ui , vi ), ·), i = 1, . . . , 2n}) . Let now U be the set of values taken by the first members ui , i = 1, . . . , 2n, and for u U, Fu = span({K((ui , vi ), ·)|ui = u, i = 1, . . . , 2n}) . Since K ((u, v), (u , v )) = 0 for u = u , we see that Fu is orthogonal to Fu for u = u and therefore F = uU Fu . For any g F , if we denote by gu its projection onto Fu for u U, this implies g 2 H = W , 0 (b, a) H2 . (16) We now show that the solution of (15) is self-adjoint, i.e., satisfies W = W . For this we show that for any W H2 , (W + W )/2 reaches a value strictly smaller than W in the objective function of (15) if W = W . Indeed, by convexity of and (16) we first get, for any a, b A: yi , yi , W, 0 (a, b) H2 W +W , 0 (a, b) 2 + 2 yi , W, 0 (b, a) H2 H2 = uU gu 2 H . . Plugging this into (15) we see that for any W H2 , the first term of the objective function is at least as small for (W + W )/2 as for W . Now, since ||W ||H2 = ||W ||H2 , and by strict convexity of the Hilbert norm, we see that the second term of the objective function is strictly smaller for (W + W )/2 than for W except if W = W . This proves that the solution of (15) is self-adjoint. Since the W solution of (15) is self-adjoint, it satisfies in particular, for any a, b A: W, 0 (a, b) H2 Moreover, since gu (u , v ) = 0 for u = u , we also have g(u, v) = gu (u, v) for any u, v U. This implies that the optimization problem (18) decouples as follows: g = arg min gF uU 1 2n (gu (u, vi ), yi ) + gu 1 i 2n ui =u 2 H . = W, 0 (b, a) H2 0 (b, a) + 0 (a, b) = W, 2 = W, 2 ({a, b}) H2 (17) H2 (19) Each gu can be found, independently from the others, by solving the subproblem in brackets using only the training pairs that start with u. Since Fu is isometric to the RKHS Ha of KA by the mapping g Fu h Ha with h(v) = gu (u, v), we see that gu is exactly the function found by the local model approach to infer the local subnetwork around u. Finally, we note that the prediction of Strategy 2 for a new pair is given by: g (u, v) + g (v, u) gu (u, v) + gv (v, u) = , 2 2 that is, exactly the same as the prediction of the local model. Proof of Theorem 1 For any function g : X R, remember that we denote by fg the function P R defined by fg (p) = 1 |p| g(x) . xp , where last equality is obtained from the definitions the 0 = 21 and (13). Plugging this equality into (15) and rearranging the terms, we see that we recover exactly (14). Therefore the solution W H2 found by the TPPK approach and by Strategy 2 with the kernel (8) are the same. Moreover, (17) shows that the predictions made by the two approaches on any pair are also identical. Proof of Proposition 2 In Strategy 2 we expand each training undirected pair p = {a, b} into two directed pairs (a, b) and (b, a), each labeled with the label of the original directed pair. Let us rename the elements of the resulting training set as ((ui , vi ), yi ), i = 1, . . . , 2n. Strategy 2 then solves the problem: 1 g = arg min 2n gH 2n We first need the following results, which relates the norm in the RKHS HP of the convolution kernel KP over P to the norm in the RKHS H of the original kernel K over X . Lemma 1. For any function f HP , it holds that f HP (g(ui , vi ), yi ) + g i=1 2 H . (18) = min{ g H : g H and fg = f } . (20) On learning with kernels for unordered pairs We prove Lemma 1 in the next subsection. Plugging (20) into (3) we observe that the function f HP which minimizes the criterion (3) is equal to fg , where g H solves the problem: g = arg min gH Combining (23) and (24), we deduce that g H satisfies E (g) = h if and only if, for all p P, fg (p) = f (p), that is, if and only if fg = f . Combining this with (22) finishes the proof of Lemma 1. 1 n n (yi , fg (pi )) + g i=1 2 H . (21) References Ben-Hur, A. and Noble, W. S. Kernel methods for predicting protein-protein interactions. Bioinformatics, 21 (Suppl. 1):i38­i46, Jun 2005. Bleakley, K., Biau, G., and Vert, J.-P. Supervised reconstruction of biological networks with local models. Bioinformatics, 23(13):i57­i65, Jul 2007. Bock, J. R. and Gough, D. A. Predicting protein-protein interactions from primary structure. Bioinformatics, 17 (5):455­460, 2001. Dietterich, T.G., Lathrop, R.H., and Lozano-Perez, T. Solving the Multiple Instance Problem with AxisParallel Rectangles. Artificial Intelligence, 89(1-2):31­ 71, 1997. G¨rtner, T., Flach, P.A., Kowalczyk, A., and Smola, A.J. a Multi-Instance Kernels. In Sammut, C. and Hoffmann, A. (eds.), Proceedings of the Nineteenth International Conference on Machine Learning, pp. 179­186. Morgan Kaufmann, 2002. Haussler, D. Convolution Kernels on Discrete Structures. Technical Report UCSC-CRL-99-10, UC Santa Cruz, 1999. Kato, T., Tsuda, K., and Asai, K. Selective integration of multiple biological data for supervised network inference. Bioinformatics, 21(10):2488­2495, May 2005. Martin, S., Roe, D., and Faulon, J.-L. Predicting proteinprotein interactions using signature products. Bioinformatics, 21(2):218­226, Jan 2005. Sch¨lkopf, B. and Smola, A. J. Learning with Kernels: o Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, 2002. Shawe-Taylor, J. and Cristianini, N. Kernel Methods for Pattern Analysis. Cambridge University Press, New York, NY, USA, 2004. Vert, J.-P. and Yamanishi, Y. Supervised graph inference. In Saul, L. K., Weiss, Y., and Bottou, L. (eds.), Adv. Neural Inform. Process. Syst., volume 17, pp. 1433­1440. MIT Press, Cambridge, MA, 2005. von Mering, C., Krause, R., Snel, B., Cornell, M., Oliver, S. G., Fields, S., and Bork, P. Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 417(6887):399­403, May 2002. Yamanishi, Y., Vert, J.-P., and Kanehisa, M. Protein network inference from multiple genomic data: a supervised approach. Bioinformatics, 20:i363­i370, 2004. Since (yi , fg (pi )) is exactly the loss L1 (g, pi , yi ) defined in (10), this concludes the proof of Theorem 1. Proof of Lemma 1 Let : P H be defined for any p P by: 1 K(x, ·) , (p) = |p| xp where K(x, ·) denotes the function t K(x, t) in H. For any p, p P we have: 1 (p), (p ) H = K(x, ·) , K(x , ·) H |p| · |p | xp,x p = 1 |p| · |p | K(x, x ) xp,x p = KP (p, p ) . If we denote by E H the closure of the linear subspace of H spanned by {(p) , p P}, this shows that E (endowed with the inner product inherited from H) is isomorphic to HP , since these are two Hilbert spaces spanned by P (respectively through (p) and KP (p, .)) whose inner products coincide. Let now f = i i KP (pi , .) be an element of HP (here the sum may be a convergent series), and let h = i i (pi ) be its image in E by the isometry between HP and E. Denoting by E : H E the orthogonal projection onto E, we obtain by isometry: : g H, E (g) = h} . (22) The functions g H such that E (g) = h are characterized by the equalities g - h, (p) = 0 for all p P or, equivalently: f HP = h H = min { g H g, (p) = h, (p) = i i (pi ), (p) i KP (pi , p) = f (p) . i (23) = On the other hand, by definition of (p), we also have for all p P: 1 1 g, K(x, .) = g(x) = fg (p) . g, (p) = |p| |p| xP xP (24)