Supervised Bipartite Graph Inference

Yoshihiro Yamanishi Center for Computational Biology Mines ParisTech 35 rue Saint-Honore, Fontainebleau, F-77300 France yoshihiro.yamanishi@ensmp.fr

Abstract
We formulate the problem of bipartite graph inference as a supervised learning problem, and propose a new method to solve it from the viewpoint of distance metric learning. The method involves the learning of two mappings of the heterogeneous objects to a unified Euclidean space representing the network topology of the bipartite graph, where the graph is easy to infer. The algorithm can be formulated as an optimization problem in a reproducing kernel Hilbert space. We report encouraging results on the problem of compound-protein interaction network reconstruction from chemical structure data and genomic sequence data.

1 Introduction
The problem of bipartite graph inference is to predict the presence or absence of edges between heterogeneous objects known to form the vertices of the bipartite graph, based on the observation about the heterogeneous objects. This problem is becoming a challenging issue in bioinformatics and computational biology, because there are many biological networks which are represented by a bipartite graph structure with vertices being heterogeneous molecules and edges being interactions between them. Examples include compound-protein interaction network consisting of interactions between ligand compounds and target proteins, metabolic network consisting of interactions between substrates and enzymes, and host-pathogen protein-protein network consisting of interactions between host proteins and pathogen proteins. Especially, the prediction of compound-protein interaction networks is a key issue toward genomic drug discovery, because drug development depends heavily on the detection of interactions between ligand compounds and target proteins. The human genome sequencing project has made available the sequences of a large number of human proteins, while the high-throughput screening of large-scale chemical compound libraries is enabling us to explore the chemical space of possible compounds[1]. However, our knowledge about the such compound-protein interactions is very limited. It is therefore important is to detect unknown compound-protein interactions in order to identify potentially useful compounds such as imaging probes and drug leads from huge amount of chemical and genomic data. A major traditional method for predicting compound-protein interactions is docking simulation [2]. However, docking simulation requires 3D structure information for the target proteins. Most pharmaceutically useful target proteins are membrane proteins such as ion channels and G proteincoupled receptors (GPCRs). It is still extremely difficult and expensive to determine the 3D structures of membrane proteins, which limits the use of docking. There is therefore a strong incentive to develop new useful prediction methods based on protein sequences, chemical compound structures, and the available known compound-protein interaction information simultaneously.


The author also belongs to Institut Curie, Paris, F-75248 France and INSERM U900, F-75248 France

1


Recently, several supervised methods for inferring a simple graph structure (e.g., protein network, enzyme network) have been developed in the framework of kernel methods [3, 4, 5]. The corresponding algorithms of the previous methods are based on kernel canonical correlation analysis [3], distance metric learning [4], and em-algorithm [5], respectively. However, the previous methods can only predict edges between homogeneous objects such as protein-protein interactions and enzyme-enzyme relations, so it is not possible to predict edges between heterogeneous objects such as compound-protein interactions and substrate-enzyme interactions, because their frameworks are based only on a simple graph structure with homogeneous vertices. In contrast, in this paper we address the problem of supervised learning of the bipartite graph rather than the simple graph. In this contribution, we develop a new supervised method for inferring the bipartite graph, borrowing the idea of distance metric learning used in the framework for inferring the simple graph [4]. The proposed method involves the learning of two mappings of the heterogeneous objects to a unified Euclidean space representing the network topology of the bipartite graph, where the graph is easy to infer. The algorithm can be formulated as an optimization problem in a reproducing kernel Hilbert space. To our knowledge, there are no statistical methods to predict bipartite graphs from observed data in a supervised context. In the results, we show the usefulness of the proposed method on the predictions of compound-protein interaction network reconstruction from chemical structure data and genomic sequence data.
Vertex with a ribute 1 in known graph Vertex with a ribute 2 in known graph Addi onal vertex with a ribute 1 Addi onal vertex with a ribute 2 Known edge Predicted edge

Figure 1: An illustration of the problem of the supervised bipartite graph inference

2 Formalism of the supervised bipartite graph inference problem
Let us formally define the supervised bipartite graph inference problem. Suppose that we are given an undirected bipartite graph G = (U + V , E ), where U = (u1 , ｷ ｷ ｷ , un1 ) and V = (v1 , ｷ ｷ ｷ , vn2 ) are sets of heterogeneous vertices and E  (U ﾗ V )  (V ﾗ U ) is a set of edges. Note that the attribute of U is completely different from that of V . The problem is, given additional sets of vertices    U  = (u , ｷ ｷ ｷ , um1 ) and V  = (v1 , ｷ ｷ ｷ , vm2 ), to infer a set of new edges E   U  ﾗ (V + V  )  1 V  ﾗ (U + U  )  (U + U  ) ﾗ V   (V + V  ) ﾗ U  involving the additional vertices in U  and V  . Figure 1 shows an illustration of this problem. The prediction of compound-protein interaction networks is a typical problem which is suitable in this framework from a practical viewpoint. In this case, U corresponds to a set of compounds (known ligands), V corresponds to a set of proteins (known targets), and E corresponds to a set of known compound-protein interactions (known ligand-target interactions). U  corresponds to a set of additional compounds (new ligand candidates), V  corresponds to a set of additional proteins (new target candidates), and E  corresponds to a set of unknown compound-protein interactions (potential ligand-target interactions). The prediction is performed based on available observations about the vertices. Sets of vertices   U = (u1 , ｷ ｷ ｷ , un1 ), V = (v1 , ｷ ｷ ｷ , vn2 ), U  = (u , ｷ ｷ ｷ , u 1 ) and V  = (v1 , ｷ ｷ ｷ , vm2 ) are reprem 1   sented by sets of observed data X = (x1 , ｷ ｷ ｷ , xn1 ), Y = (y1 , ｷ ｷ ｷ , yn2 ), X = (x1 , ｷ ｷ ｷ , x 1 ) and m   Y  = (y1 , ｷ ｷ ｷ , ym2 ), respectively. For example, compounds are represented by molecular structures and proteins are represented by amino acid sequences. The question is how to predict unknown compound-protein interactions from compound structures and protein sequences using prior knowledge about known compound-protein interactions. Sets of U and V (X and Y ) are referred to as training sets, and heterogeneous objects are represented by u and v in the sense of vertices on the bipartite graph or by x and y in the sense of objects in the observed data below. 2


In order to deal with the data heterogeneity and take advantage of recent works on kernel similarity functions on general data structures [6], we will assume that X is a set endnwed with a positive defo inite kernel ku , that is, a symmetric function ku : X 2  R satisfying i,1 =1 ai aj ku (xi , xj )  0 j for any n1  N, (a1 , a2 , ｷ ｷ ｷ , an1 )  Rn1 and (x1 , x2 , ｷ ｷ ｷ , xn1 )  X . Similarly, we will assume that Y is a set endowed with a positive definite kernel kv , that is, a symmetric function n kv : Y 2  R satisfying i,2 =1 ai aj kv (yi , yj )  0 for any n2  N, (a1 , a2 , ｷ ｷ ｷ , an2 )  Rn2 j and (y1 , y2 , ｷ ｷ ｷ , yn2 )  Y .

3 Distance metric learning (DML) for the bipartite graph inference
3.1 Euclidean embedding and distance metric learning (DML) Suppose that a bipartite graph must be reconstructed from the similarity information about n1 objects (x1 , ｷ ｷ ｷ , xn1 ) in X (observed data for U ) and n2 objects (y1 , ｷ ｷ ｷ , yn2 ) in Y (observed data for V ). One difficulty is that the attribute of observed data differs between X and Y in nature, so it is not possible to evaluate the link between (x1 , ｷ ｷ ｷ , xn1 ) and (y1 , ｷ ｷ ｷ , yn2 ) from the observed data directly. For example, in the case of compounds and proteins, each x has a chemical graph structure and each y has a sequence structure, so the data structures completely differ between x and y . Therefore, we make an assumption that n1 objects (x1 , ｷ ｷ ｷ , xn1 ) and n2 objects (y1 , ｷ ｷ ｷ , yn2 ) are implicitly embedded in a unified Euclidean space Rd , and a graph is inferred on those heterogeneous points by the nearest neighbor approach, i.e., putting an edge between heterogeneous points that are close to each other. We propose the following two step procedure for the supervised bipartite graph inference: 1. embed the heterogeneous objects into a unified Euclidean space representing the network topology of the bipartite graph, where connecting heterogeneous vertices are close to each other, through mappings f : X  Rd and g : Y  Rd 2. apply the mappings f and g to X  and Y  respectively, and predict new edges between the heterogeneous objects if the distance between the points {f (x), x  X  X  } and {g (y ), y  Y  Y  } is smaller than a fixed threshold  . While the second step in this procedure is fixed, the first step can be optimized by supervised learning of f and g using the known bipartite graph. To do so, we require the mappings f and g to map adjacent heterogeneous vertices in the known bipartite graph onto nearby positions in a unified Euclidian space Rd , in order to ensure that the known bipartite graph can be recovered to some extent by the nearest neighbor approach. Given functions f : X  R and g : Y  R, a possible criterion to assess whether connected (resp. disconnected) heterogeneous vertices are mapped onto similar (resp. dissimilar) points in R is the following: ( ( 2 2 ui ,vj )E (f (xi ) - g (yj )) / ui ,vj )E (f (xi ) - g (yj )) - ( R(f , g ) = . (1 ) 2 ui ,vj )U ﾗV (f (xi ) - g (yj )) A small value of R(f , g ) ensures that connected heterogeneous vertices tend to be closer than disconnected heterogeneous vertices in the sense of quadratic error. To represent the connectivity between heterogeneous vertices on the bipartite graph G = (U + V , E ), we define a kind of the adjacency matrix Auv , where element (Auv )ij is equal to 1 (resp. 0) if vertices ui and vj are connected (resp. disconnected). Note that the size of the matrix Auv is n1 ﾗ n2 . We also define a kind of the degree matrix of the heterogeneous vertices as Du and Dv , where diagonal elements (Du )ii and (Dv )j j are the degrees of vertices ui and vj (the numbers of edges involving vertices ui and vj ), respectively. Note that all non-diagonal elements in Du and Dv are zero, and the sizes of the matrices are n1 ﾗ n1 and n2 ﾗ n2 , respectively. Let us denote by fU = (f (x1 ), ｷ ｷ ｷ , f (xn1 ))T  Rn1 and gV = (g (y1 ), ｷ ｷ ｷ , g (yn2 ))T  Rn2 the values taken by f and g on the training set. If we restrict fU and fV to have zero means as 3


n1

i=1

f (xi ) = 0 and

g (yi ) = 0, then the criterion (1) can be rewritten as follows: f f T fD -Auv u U U gV gV -ATv Dv u R(f , g ) = 4 T f -2
i=1 U U

n2

(2 )

gV

gV

To avoid the over-fitting problem and obtain meaningful solutions, we propose to regularize the criterion (1) by a smoothness functional on f and g based on a classical approach in statistical learning [7, 8]. We assume that that f and g belong to the reproducing kernel Hilbert space (r.k.h.s.) HU and HV defined by the kernels ku on X and kv on Y , and to use the norms of f and g as regularization operators. Let us define by ||f || and ||g || the norms of f and g in HU and HV . Then, the regularized criterion to be minimized becomes: + f T D f -Auv u U U 1 ||f ||2 + 2 ||g ||2 gV gV -ATv Dv u R(f , g ) = (3 ) f T f ,
U U

gV

gV

where 1 and 2 are regularization parameters which control the trade-off between minimizing the original criterion (1) and ensuring that the solution has a small norm in the r.k.h.s. The criterion is defined up to a scaling of the functions and the solution is therefore a direction in the r.k.h.s. Here we set additional constraints. In this case we impose the norm ||f || = ||g || = 1, which corresponds to an orthogonal projection onto the direction selected in the r.k.h.s. Note that the criterion can be used for extracting one-dimentional feature of the objects. In order to obtain a d-dimensional feature representation of the objects, we propose to iterate the minimization of the regularized criterion (3) under orthogonality constraints in the r.k.h.s., that is, we recursively define the p-th features fp and gp for p = 1, ｷ ｷ ｷ , d as follows: f f T D + -Auv u U U 1 ||f ||2 + 2 ||g ||2 gV gV -ATv Dv u (fp , gp ) = arg min f T f (4 )
U U

gV

gV

under the orthogonality constraints: f f1 , ｷ ｷ ｷ , fp-1 , and g g1 , ｷ ｷ ｷ , gp-1 . In the prediction process, we map any new objects x  X  and y   Y  by the mappings f and g respectively, and predict new edges between the heterogeneous objects if the distance between the points {f (x), x  X  X  } and {g (y ), y  Y  Y  } is smaller than a fixed threshold  . 3.2 Algorithm Let ku and kv be the kernels on the sets X and Y , where the kernels are both centered in HU and HV . According to the representer theorem [9] in the r.k.h.s., for any p = 1, ｷ ｷ ｷ , d, the solution to equation (4) has the following expansions: fp (x) = jn1 p,j ku (xj , x), gp (y ) = jn2 p,j kv (yj , y ), (5 )

=1

=1

for some vector p = (p,1 , ｷ ｷ ｷ , p,n1 )T  Rn1 and  p = (p,1 , ｷ ｷ ｷ , p,n2 )T  Rn2 . Let Ku and Kv be the Gram matrices of the kernels ku and ku such that (Ku )ij = ku (xi , xj ), i, j = 1, ｷ ｷ ｷ , n1 and (Kv )ij = kv (yi , yj ), i, j = 1, ｷ ｷ ｷ , n2 . The corresponding feature vectors fp,U and gp,V can be written as fp,U = Ku p and gp,V = Kv  p , respectively. The squared norms of features f and g in HU and HV are equal to ||f ||2 = T Ku  and ||g ||2 = T Kv  , so the normalization constraints for f and g can be written as T Ku  = T Kv  = 1. The orthogonarity constraints fp fq and gp gq (p = q ) can be written by T Ku q = 0 and  T Kv  q = 0. p p 4


Using the above representations, the minimization problem of R(f , g ) is equivalent to finding  and  which minimize + K T -Ku Auv Kv u Du K u 1 T Ku  + 2  T Kv    -Kv ATv Ku K v Dv K v u R(f , g ) = K T , (6 )  0 u Ku 0 Kv Kv   under the following orthogonality constraints: T Ku 1 = ｷ ｷ ｷ = T Ku (p-1) = 0,  T Kv  1 = ｷ ｷ ｷ =  T Kv  (p-1) = 0.

Taking the differential of equation (6) with respect to  and  and setting to zero, the solution of the first vectors 1 and  1 can be obtained as the eigenvectors associated with the smallest (nonnegative) eigenvalue in the following generalized eigenvalue problem: ( = K   K -Ku Auv Kv 0 u Du Ku + 1 Ku u Ku 7)  0 Kv Kv   -Kv ATv Ku Kv Dv Kv + 2 Kv u Sequentially, the solutions of vectors 1 , ｷ ｷ ｷ , d and  1 , ｷ ｷ ｷ , d can be obtained as the eigenvectors associated with d smallest (non-negative) eigenvalues in the above generalized eigenvalue problem.

4 Relationship with other methods
The process of embedding heterogeneous objects into the same space is similar to correspondence analysis (CA) [10] and Co-Occurence Data Embedding (CODE) [11] which are unsupervised methods to embed the rows and columns of a contingency table (adjacency matrix Auv in this study) into a low dimensional Euclidean space. However, critical differences with our proposed method are as follows: i) the above methods cannot use observed data (X and Y in this study) about heterogeneous nodes for prediction, because the algorithms are based only on co-occurence information (Auv in this study), and ii) we need to define a new representation of not only the objects in the training set but also additional objects outside of the training set. Therefore, it is not possible to directly apply the above methods to the bipartite graph inference problem. Recall that the goal of the ordinary CA is to find embedding functions  : U  R and  : V  R which maximize the following correlation coefficient: i ,j I {(ui , vj )  E }(ui ) (vj ) j corr(,  ) = , (8 ) i dui (ui )2 dvj  (vj )2

where I {ｷ} is an indicator function which returns 1 if the argument is jtrue or 0 otherwise, dui (resp. i  (vj ) = 0) is assumed [10]. (ui ) = 0 (resp. dvj ) is the degree of node ui (resp. vj ), and Here we attempt to consider an extension of the CA using the idea of kernel methods so that it can work in the context of the bipartite graph inference problem. The method is referred to as kernel correspondence analysis (KCA) below. To formulate the KCA, we propose to replace the embedding functions  : U  R and  : V  R by functions f : X  R and g : Y  R, where f and g belong to the r.k.h.s. HU and HV defined by the kernels ku on X and kv on Y . Then, we consider maximizing the following regularized correlation coefficient: i ,j I {(ui , vj )  E }f (xi )g (yj ) j corr(f , g ) = , (9 ) i dui f (xi )2 + 1 ||f ||2 dvj g (yj )2 + 2 ||g ||2

where 1 and 2 are regularization parameters which control the trade-off between maximizing the original correlation coefficient between two features and ensuring that the solution has a small norm in the r.k.h.s. In order to obtain a d-dimensional feature representation and deal with the scale issue, we propose to iterate the maximization of the regularized correlation coefficient (9) under orthogonality constraints in the r.k.h.s., that is, we recursively define the p-th features fp 5


and gp for p = 1, ｷ ｷ ｷ , d as (fp , gp ) = arg max corr(f , g ) under the orthogonality constraints: f f1 , ｷ ｷ ｷ , fp-1 and g g1 , ｷ ｷ ｷ , gp-1 and the normalization constraints: ||f || = ||g || = 1. Using the function expansions in equation (5) and related matrix representations defined in the previous section, the maximization problem of the regularized correlation coefficient in equation (9) is equivalent to finding  and  which maximize corr(f , g ) = T Ku Auv Kv   . T T K D K  +  T K  Kv Dv Kv  + 2  T Kv  uuu 1 u (1 0 )


Taking the differential of equation (10) with respect to  and  and setting to zero, the solution of the first vectors 1 and 1 can be obtained as the eigenvectors associated with the largest eigenvalue in the following generalized eigenvalue problem: = K  . 0 Ku Auv Kv 0 u Du Ku + 1 Ku  0 Kv Dv Kv + 2 Kv   Kv ATv Ku 0 u (1 1 ) Sequentially, the solutions of vectors 1 , ｷ ｷ ｷ , d and  1 , ｷ ｷ ｷ , d can be obtained as the eigenvectors associated with d largest eigenvalues in the above generalized eigenvalue problem. The final form of KCA is similar to that of kernel canonical correlation analysis (KCCA) [12, 13], so KCA can be regarded as a variant of KCCA. However, the critical differences between KCA and KCCA are as follows: i) the objects are the same across two different data in KCCA, while the objects are different across two different data in KCA, and ii) KCCA cannot deal with co-occurence information about the objects. In the experiment below, we are interested in the performance comparison between the distance learning in DML and correlation maximization in KCA. A similar extension might be possible for CODE as well, but it is out of scope in this paper.

5 Experiment
5 . 1 Da t a In this study we focus on compound-protein interaction networks made by four pharmaceutically useful protein classes: enzymes, ion channels, G protein-coupled receptors (GPCRs), and nuclear receptors. The information about compound-protein interactions were obtained from the KEGG BRITE [14], SuperTarget [15] and DrugBank databases [16]. The number of known interactions involving enzymes, ion channels, GPCRs, and nuclear receptors is 5449, 3020, 1124, and 199, respectively. The number of proteins involving the interactions is 1062, 242, 134, and 38, respectively, and the number of compounds involving the interactions is 1123, 475, 417, and 115, respectively. The compound set includes not only drugs but also experimentally confirmed ligand compounds. These data are regarded as gold standard sets to evaluate the prediction performance below. Chemical structures of the compounds and amino acid sequences of the human proteins were obtained from the KEGG database [14]. We computed the kernel similarity value of chemical structures between compounds using the SIMCOMP algorithm [17], where the kernel similarity value between two compounds is computed by Tanimoto coefficient defined as the ratio of common substructures between two compounds based on a graph alignment algorithm. We computed the sequence similarities between the proteins using Smith-Waterman scores based on the local alignment between two amino acid sequences [18]. In this study we used the above similarity measures as kernel functions, but the Smith-Waterman scores are not always positive definite, so we added an appropriate identify matrix such that the corresponding kernel Gram matrix is positive definite, which is related with [19]. All the kernel matrices are normalized such that all diagonals are ones. 5.2 Performance evaluation As a baseline method, we used the nearest neighbor (NN) method, because this idea has been used in traditional molecular screening in many public databases. Given a new ligand candidate compound, we find a known ligand compound (in the training set) sharing the highest structure similarity with the new compound, and predict the new compound to interact with proteins known to interact with 6


Table 1: AUC (ROC scores) for each interaction class, where "train c.", "train p.", "test c.", and "test p." indicates training compounds, training proteins, test compounds and test proteins, respectively.
Data Enzyme Prediction class i) test c. vs train p. ii) train c. vs test p. iii) test c. vs test p. iv) all c. vs all p. i) test c. vs train p. ii) train c. vs test p. iii) test c. vs test p. iv) all c. vs all p. i) test c. vs train p. ii) train c. vs test p. iii) test c. vs test p. iv) all c. vs all p. i) test c. vs train p. ii) train c. vs test p. iii) test c. vs test p. iv) all c. vs all p. Nearest neighbor ( NN) 0.655 ｱ 0.011 0.758 ｱ 0.008 0.500 ｱ 0.000 0.684 ｱ 0.006 0.712 ｱ 0.004 0.896 ｱ 0.008 0.500 ｱ 0.000 0.770 ｱ 0.004 0.714 ｱ 0.005 0.781 ｱ 0.026 0.500 ｱ 0.000 0.720 ｱ 0.013 0.715 ｱ 0.009 0.683 ｱ 0.010 0.500 ｱ 0.000 0.675 ｱ 0.004 Kernel correspondence analysis (KCA) 0.741 ｱ 0.011 0.839 ｱ 0.009 0.692 ｱ 0.008 0.778 ｱ 0.008 0.768 ｱ 0.008 0.927 ｱ 0.004 0.748 ｱ 0.004 0.838 ｱ 0.005 0.848 ｱ 0.002 0.895 ｱ 0.025 0.823 ｱ 0.038 0.866 ｱ 0.015 0.808 ｱ 0.018 0.784 ｱ 0.012 0.670 ｱ 0.053 0.784 ｱ 0.011 Distance metric learning (DML) 0.843 ｱ 0.006 0.878 ｱ 0.003 0.782 ｱ 0.013 0.852 ｱ 0.020 0.800 ｱ 0.004 0.945 ｱ 0.002 0.771 ｱ 0.008 0.864 ｱ 0.002 0.882 ｱ 0.005 0.936 ｱ 0.004 0.864 ｱ 0.013 0.904 ｱ 0.003 0.832 ｱ 0.013 0.812 ｱ 0.036 0.747 ｱ 0.049 0.815 ｱ 0.024

Ion channel GP C R

Nuclear receptor

the nearest ligand compound. Likewise, given a new target candidate protein, we find a known target protein (in the training set) sharing the highest sequence similarity with the new protein, and predict the new protein to interact with ligand compounds known to interact with the nearest target protein. Newly predicted compound-protein interaction pairs are assigned prediction scores with the highest structure or sequence similarity values involving new compounds or new proteins in order to draw the ROC curve below. We tested the three different methods: NN, KCA, and DML on their abilities to reconstruct the four compound-protein interaction networks. We performed the following 5-fold cross-validation procedure: the gold standard set was split into 5 subsets of roughly equal size by compounds and proteins, each subset was then taken in turn as a test set, and the training is performed on the remaining 4 sets. We draw a receiver operating curve (ROC), the plot of true positives as a function of false positives based on various thresholds  , where true positives are correctly predicted interactions and false positives are predicted interactions that are not present in the gold standard interactions. The performance was evaluated by AUC (area under the ROC curve) score. The regularization parameter  and the number of features d are optimized by applying the internal cross-validation within the training set with the AUC score as a target criterion in the case of KCA and DML. To obtain robust results, we repeated the above cross-validation experiment five times, and computed the average and standard deviation of the resulting AUC scores. Table 1 shows the resulting AUC scores for different sets of predictions depending on whether the compound and/or the protein were in the initial training set or not. Compounds and proteins in the training set are called training compounds and proteins whereas those not in the training set are called test compounds and proteins. Four different classes are then possible: i) test compounds vs training proteins, ii) training compounds vs test proteins, iii) test compounds vs test proteins, and iv) all the possible predictions (the average of the above three parts). Comparing the three different methods, DML seems to have the best performance for all four types of compound-protein interaction networks, and outperform the other methods KCA and NN at a significant level. The worst performance of NN implies that raw compound structure or protein sequence similarities do not always reflect the tendency of interaction partners in true compound-protein interaction networks. Among the four prediction classes, predictions where neither the protein nor the compound are in the training set (iii) are weakest, but even then reliable predictions are possible in DML. Note that the NN method cannot predict iii) test vs test interaction, because it depends on the template information about known ligand compounds and known target proteins. These results suggest that the feature space learned by DML successfully represents the network topology of the bipartite graph structure of compound-protein networks, and the correlation maximization learning used in KCA is not enough to reflect the network topology of the bipartite graph. 7


6 Concluding remarks
In this paper, we developed a new supervised method to infer the bipartite graph from the viewpoint of distance metric learning (DML). The originality of the proposed method lies in the embedding of heterogeneous objects forming vertices on the bipartite graph into a unified Euclidian space and in the learning of the distance between heterogeneous objects with different data structures in the unified feature space. We also discussed the relationship with correspondence analysis (CA) and kernel canonical correlation analysis (KCCA). In the experiment, it is shown that the proposed method DML outperforms the other methods on the problem of compound-protein interaction network reconstruction from chemical structure and genomic sequence data. From a practical viewpoint, the proposed method is useful for virtual screening of a huge number of ligand candidate compounds being generated with various biological assays and target candidate proteins toward genomic drug discovery. It should be also pointed out that the proposed method can be applied to other network prediction problems such as metabolic network reconstruction, host-pathogen protein-protein interaction prediction, and customer-product recommendation system as soon as they are represented by bipartite graphs.

References
[1] C.M. Dobson. Chemical space and biology. Nature, 432:824ｭ828, 2004. [2] M. Rarey, B. Kramer, T. Lengauer, and G. Klebe. A fast flexible docking method using an incremental construction algorithm. J Mol Biol, 261:470ｭ489, 1996. [3] Y. Yamanishi, J.P. Vert, and M. Kanehisa. Protein network inference from multiple genomic data: a supervised approach. Bioinformatics, 20 Suppl 1:i363ｭ370, 2004. [4] J.-P. Vert and Y. Yamanishi. Supervised graph inference. Advances in Neural Information and Processing System, pages 1433ｭ1440, 2005. [5] T. Kato, K. Tsuda, and K. Asai. Selective integration of multiple biological data for supervised network inference. Bioinformatics, 21:2488ｭ2495, 2005. [6] B. Scholkopf, K. Tsuda, and J.P. Vert. Kernel Methods in Computational Biology. MIT Press, 2004. ｨ [7] G. Wahba. Splines Models for Observational Data: Series in Applied Mathematics. SIAM, Philadelphia, 1990. [8] F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural networks architectures. Neural Computation, 7:219ｭ269, 1995. [9] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Camb. Univ. Press, 2004. [10] M.J. Greenacre. Theory and applications of correspondence analysis. Academic Press, 1984. [11] A. Globerson, G. Chechik, F. Pereira, and N. Tishby. Euclidean embedding of co-occurrence data. Advances in Neural Information and Processing System, pages 497ｭ504, 2005. [12] S. Akaho. A kernel method for canonical correlation analysis. International. Meeting on Psychometric Society (IMPS2001), 2001. [13] F.R. Bach and M.I. Jordan. Kernel independent component analysis. Journal of Machine Learning Research, 3:1ｭ48, 2002. [14] M. Kanehisa, S. Goto, M. Hattori, K.F. Aoki-Kinoshita, M. Itoh, S. Kawashima, T. Katayama, M. Araki, and M. Hirakawa. From genomics to chemical genomics: new developments in kegg. Nucleic Acids Res., 34(Database issue):D354ｭ357, Jan 2006. [15] S. Gunther, S. Guenther, M. Kuhn, M. Dunkel, and et al. Supertarget and matador: resources for exploring drug-target relationships. Nucleic Acids Res, 2007. [16] D.S. Wishart, C. Knox, A.C. Guo, D. Cheng, S. Shrivastava, D. Tzur, B. Gautam, and M. Hassanali. Drugbank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res, 2007. [17] M. Hattori, Y. Okuno, S. Goto, and M. Kanehisa. Development of a chemical structure comparison method for integrated analysis of chemical and genomic information in the metabolic pathways. J. Am. Chem. Soc., 125:11853ｭ11865, 2003. [18] T.F. Smith and M.S. Waterman. Identification of common molecular subsequences. J Mol Biol, 147:195ｭ 197, 1981. [19] H. Saigo, J.P. Vert, N. Ueda, and T. Akutsu. Protein homology detection using string alignment kernels. Bioinformatics, 20:1682ｭ1689, 2004.

8