Pacific Symposium on Biocomputing 12:88-99(2007) September 24, 2006 23:36 Proceedings Trim Size: 9in x 6in sridhar AN ITERATIVE ALGORITHM FOR METABOLIC NETWORK-BASED DRUG TARGET IDENTIFICATION PADMAVATI SRIDHAR, TAMER KAHVECI AND SANJAY RANKA Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA, 32611 E-mail: {psridhar, tamer, ranka}@cise.ufl.edu Post-genomic advances in bioinformatics have refined drug-design strategies, by focusing on the reduction of serious side-effects through the identification of enzymatic targets. We consider the problem of identifying the enzymes (i.e., drug targets), whose inhibition will stop the production of a given target set of compounds, while eliminating minimal number of non-target compounds. An exhaustive evaluation of all possible enzyme combinations to find the optimal solution subset may become computationally infeasible for very large metabolic networks. We propose a scalable iterative algorithm which computes a sub-optimal solution within reasonable time-bounds. Our algorithm is based on the intuition that we can arrive at a solution close to the optimal one by tracing backward from the target compounds. It evaluates immediate precursors of the target compounds and iteratively moves backwards to identify the enzymes whose inhibition will stop the production of the target compounds while incurring minimum side-effects. We show that our algorithm converges to a sub-optimal solution within a finite number of such iterations. Our experiments on the E.Coli metabolic network show that the average accuracy of our method deviates from that of the exhaustive search only by 0.02 % . Our iterative algorithm is highly scalable. It can solve the problem for the entire metabolic network of Escherichia Coli in less than 10 seconds. 1. Introduction Traditional drug development approaches focused more on the efficacy of drugs than their toxicity (untoward side effects). Lack of predictive models that account for the complexity of the inter-relationships between the metabolic processes often leads to drug development failures. Toxicity and/or lack of efficacy can result if metabolic network components other than the intended target are affected. This is well-illustrated by the example of the recent failure of Tolcapone (a new drug developed for Parkinson's disease) due to observed hepatic toxicity in some patients 9 . Post-genomic drug research focuses more on the identification of specific biological targets (gene products, such as enzymes or proteins) for drugs, which can be manipulated to produce the desired effect (of curing a disease) with minimum disruptive side-effects 20,24 . The main phases in such a drug development strategy are target identification, validation and lead inhibitor identification 7 . Work supported partially by ORAU (Award no: 00060845). The work of Sanjay Ranka is supported in part by the National Science Foundation under Grant ITR 0325459. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. 1 Pacific Symposium on Biocomputing 12:88-99(2007) September 24, 2006 23:36 Proceedings Trim Size: 9in x 6in sridhar Enzymes catalyze reactions, which produce metabolites (compounds) in the metabolic networks of organisms. Enzyme malfunctions that result in the accumulation of certain compounds may result in diseases. We term such compounds as Target Compounds and the remaining compounds as Non-Target compounds. For instance, the malfunction of enzyme phenylalanine hydroxylase causes buildup of the amino acid, phenylalanine, resulting in phenylketonuria 23 , a disease that causes mental retardation. Hence, it is intuitive to identify the optimal enzyme set that can be manipulated by drugs to prevent the excess production of target compounds, with minimal side-effects. We term the side-effects of inhibiting a certain enzyme combination as the damage caused to the metabolic network. Formally, we define damage of inhibiting an enzyme as the number of non-target compounds whose production is stopped by the inhibition of that specific enzyme. In our earlier work 22 , we developed a graph model for metabolic networks based on the boolean network model 21 . In our model, R, C , and E denote the set of reactions, compounds, and enzymes respectively. The node set consists of all the members of R C E . A node is labeled as reaction, compound, or enzyme based on the entity it refers to. Edges represent the interactions in the network. A directed edge from vertex x to vertex y is drawn if one of the following three conditions holds: (1) x represents an enzyme that catalyzes the reaction represented by y . (2) x corresponds to a reactant for the reaction represented by y . (3) x represents a reaction that produces the compound mapped to y . We assume that the inputs to all reactions and compounds are already present in the network and that there are no external inputs. Figure 1(a) illustrates a small hypothetical metabolic network. A directed edge from an enzyme to a reaction implies that the enzyme catalyzes that reaction. For instance, E1 catalyzes R1 and R2 . A directed edge from a compound to a reaction implies that the compound is a reactant (input compound). A directed edge from a reaction to a compound implies that the compound is a product (output compound). In this figure, C1 is the target compound (i.e., the production of C1 should be stopped). In order to stop its production, we have to prevent R1 from taking place. This can be accomplished in two ways: (1) By disrupting one of its catalyzing enzymes (E1 in this case). Figure 1(b) shows the effects of disrupting E1 . The resulting damage is calculated as the number of non-target compounds whose production is stopped. Since the production of C2 , C3 and C4 is stopped, the damage due to the disruption of E1 is 3. (2) By stopping the production of one of its reactant compounds (C5 in this case). To stop the production of C5 , we need to recursively look for the enzyme combination which is indirectly responsible for its production (E2 and E3 ). The combined damage of E2 and E3 is 1. Thus, the production of the target compound can be stopped by manipulating either E1 or a combination of E2 and E3 . The optimal solution is the enzyme combination whose disruption has the minimum damage on the network (E2 and E3 in this case). Problem: Given a large metabolic network and a set of target compounds, we con- Pacific Symposium on Biocomputing 12:88-99(2007) September 24, 2006 23:36 Proceedings Trim Size: 9in x 6in sridhar E2 R3 C1 C5 R4 R1 E2 R3 C5 R1 C1 E3 E3 R4 C2 E1 R2 C3 C4 E1 R2 C2 C3 C4 (a) (b) Figure 1. (a)A graph constructed for a metabolic network with four reactions R1 , · · · , R4 , three enzymes E1 , E2 and E3 , and five compounds C1 , · · · , C5 . Circles, rectangles, and triangles denote compounds, reactions, and enzymes respectively. Here, C1 (shown by double circle) is the target compound. (b)Effect of inhibiting E1 . Dotted lines indicate the subgraph removed due to inhibition of enzyme E1 . sider the problem of identifying the set of enzymes whose inhibition eliminates all the target compounds and inflicts minimum damage on the rest of the network. Evaluating all enzyme combinations is not feasible for metabolic networks with a large number of enzymes. This is because, the number of enzyme combinations, i.e., 2|E | - 1, increases exponentially with the number of enzymes. Efficient and precise heuristics are needed to tackle this problem. Note that different enzymes and compounds may have varying levels of importance in the metabolic network. Our model simplistically considers all the enzymes and compounds to be of equal importance. It can be extended by assigning weights to enzymes and compounds based on their roles in the network. However, we do not discuss these extensions in this paper. Contribution: In this paper, we develop a scalable iterative algorithm as an approximation to the optimal enzyme combination detection problem. Our algorithm is based on the intuition that we can arrive at a solution close to the optimal one by tracing backward from the target compounds. It starts by finding the damage incurred due to the removal of each reaction or compound by evaluating its immediate precursors. It then iteratively improves the damage by considering the damage computed for the immediate precursors. It converges when the damage values cannot be improved any further. We prove that the number of iterations is at most the number of reactions on the longest path from any enzyme to the target compounds in the underlying pathway. To the best of our knowledge, this is the first polynomial time solution for a metabolic-network based drug target identification problem. The rest of the paper is organized as follows. Section 2 discusses the related work. Section 3 presents the proposed iterative algorithm for determining the enzyme combination whose inhibition achieves the desired effect of inhibiting the production of target compounds. Section 4 presents a theoretical analysis of the algorithm. Section 5 discusses experimental results. Section 6 concludes the paper. Pacific Symposium on Biocomputing 12:88-99(2007) September 24, 2006 23:36 Proceedings Trim Size: 9in x 6in sridhar 2. Related work Traditional pharmacological drug discovery approaches involve the incorporation of a large number of hypothetical targets into cell-based assays and automated high throughput screening (HTS) of vast chemical compound libraries7 . Post-genomic advances in bioinformatics have fostered the development of rational drug-design strategies, that seek to reduce serious side-effects8,4,3 . This era has brought about the concept of reverse pharmacology, in which, the first step is the identification of protein targets, that may be critical intervention points in a disease process24,20,1 . Since this method is driven by the mechanics of the disease, it is expected to be more efficient than the classical approach24 . Rapid identification of enzyme (or protein) targets needs a thorough understanding of the underlying metabolic network of the organism affected by a disease. The availability of fully sequenced genomes has enabled researchers to integrate the available genomic information to reconstruct and study metabolic networks17 . These studies have revealed important properties of metabolic networks10,2,15 . The potential of an enzyme to be an effective drug target is considered to be related to its essentiality in the corresponding metabolic network13 . Lemke et. al proposed the measure enzyme damage as an indicator of enzyme essentiality 14,16 . Recently, a computational approach to prioritize potential drug targets for antimalarial drugs was developed 18 . A choke-point analysis of P.falciparcum was performed to identify essential enzymes which are potential drug targets. The possibility of using enzyme inhibitors as antiparasitic drugs is being investigated through stoichiometric analysis of the metabolic networks of parasites 5,6 . These studies show the effectiveness of computational techniques in reverse pharmacology. A combination of gene-knockout and micro-array time-course data was used to study the effects of a chemical compound on a gene network 12 . An investigation of metabolite essentiality was carried out with the help of stoichiometric analysis 11 . These approaches underline the importance of studying the role of compounds (metabolites) during the pursuit of computational solutions to pharmacological problems. 3. Iterative algorithm In this section, we develop a scalable iterative algorithm that finds a sub-optimal solution to the enzyme-target identification problem quickly. Our algorithm is based on the intuition that we can arrive at a solution close to the optimal one, by tracing backwards from the target compounds. We evaluate the immediate precursors of the target compounds and iteratively move backwards to identify the enzymes, whose inhibition will stop the production of the target compounds while incurring minimum damage. Our algorithm consists of an initialization step followed by iterations, until some convergence criteria is satisfied. Let E , R and C denote the sets of enzymes, reactions and compounds of the metabolic network respectively. Let |E |, |R| and |C | denote the number of enzymes, reactions and compounds respectively. Pacific Symposium on Biocomputing 12:88-99(2007) September 24, 2006 23:36 Proceedings Trim Size: 9in x 6in sridhar The primary data structures are three vectors, namely an enzyme vector VE = [e1 , e2 , · · · , e|E | ], a reaction vector VR = [r1 , r2 , · · · , r|R| ], and a compound vector VC = [c1 , c2 , · · · , c|C | ]. Each value, ei , in VE denotes the damage of inhibition of enzyme, Ei E . Each value, ri , in VR denotes the damage incurred by stopping the reaction Ri R. Each value, ci , in VC denotes the damage incurred by stopping the production of the compound Ci C . Initialization: Here, we describe the initialization of vectors VE , VR , and VC . We initialize VE first, VR second, and VC last. Enzyme vector: The damage ei , i, 1 i |E |, is computed as the number of nontarget compounds whose productions stop after inhibiting Ei . We find the number of such compounds by doing a breadth-first traversal of the metabolic network starting from Ei . We calculate the damage ei associated with every enzyme Ei E , i, 1 i |E |, and store it at position i in the enzyme vector VE . Reaction vector: The damage rj is computed as the minimum of the damages of the enzymes that catalyze Rj , j, 1 j |R|. In other words, let E1 , E2 , · · · , Ek be the enzymes that catalyze Rj . We compute the damage of rj as rj = mink=1 {ei }. This computation is intuitive since a reaction can be disrupted by i inhibiting any of its catalyzers. We calculate rj associated with every reaction Rj R, j, 1 j |R| and store it at position j in the reaction vector VR . Let E (Rj ) denote the set of enzymes that produced the damage rj . Along with rj , we also store E (Rj ). Note that in our model, we do not consider back-up enzyme activities for simplicity. Compound vector: The damage ck , k , 1 k |C |, is computed by considering the reactions that produce Ck . Let R1 , R2 , · · · , Rj be the reactions that produce Ck . We first compute a set of enzymes E (Ck ) for Ck as E (Ck ) = E (R1 ) E (R2 )· · ·E (Rj ). We then compute the damage value ck as the number of nontarget compounds that is deleted after the inhibition of all the enzymes in E (Ck ). This computation is based on the observation that a compound disappears from the system only if all the reactions that produce it stop. We calculate ck associated with every compound Ck C , 1 k |C | and store it at position k in the compound vector VC . Along with ck , we also store E (Ck ). Column I0 in Table 1 shows the initialization of the vectors for the network in Figure 1. The damage e1 of E1 is three, as inhibiting E1 stops the production of three non-target compounds C2 , C3 and C4 . Since the disruption of E2 or E3 alone does not stop the production of any non-target compound, their damage values are zero. Hence, VE = [3, 0, 0]. The damage values for reactions are computed as the minimum of their catalyzers (r1 = r2 = e1 and r3 = r4 = e2 ). Hence, VR =[3, 3, 0, 0]. The damage values for compounds are computed from the reactions that produce them. For instance, R1 and R2 produce C2 . E (R1 ) = E (R2 ) = {E1 }. Therefore, c2 = e1 . Similarly c5 is equal to the damage of inhibiting the set E (R3 ) E (R4 ) = {E2 , E3 }. Thus, c5 = 1. Iterative steps: We iteratively refine the damage values in vectors VR and VC in a Pacific Symposium on Biocomputing 12:88-99(2007) September 24, 2006 23:36 Proceedings Trim Size: 9in x 6in sridhar Table 1. Iterative Steps: I0 is the initialization step; I1 and I2 are the iterations. VR and VC represent the damage values of reactions and compounds respectively computed at each iteration. VE = [3, 0, 0] in all iterations. I0 VR , VC [3, 3, 0, 0], [3, 3, 3, 3, 1] I1 [1, 3, 0, 0], [1, 3, 3, 3, 1] I2 [1, 3, 0, 0], [1, 3, 3, 3, 1] number of steps. At each iteration, the values are updated by considering the damage of the precursor of the precursors. Thus, at nth iteration, the precursors from which a reaction or a compound is reachable on a path of length up to n are considered. We define the length of a path on the graph constructed for a metabolic network as the number of reactions on that path (see Definition 4.2). There is no need to update VE since the enzymes are not affected by the reactions or the compounds. Next, we describe the actions taken to update VR and VC at each iteration. We later discuss the stopping criteria for the iterations. Reaction vector: Let C1 , C2 , · · · , Ct be the compounds that are input to Rj . We update the damage of rj as rj = min{rj , mint=1 {ci }}. i The first term of the min function denotes the damage value calculated for Rj during the previous iteration. The second term provides the damage of the input compound with the minimum damage found in the previous iteration. This computation is intuitive since a reaction can be disrupted by stopping the production of any of its input compounds. The damage of all the input compounds are already computed in the previous iteration (say (n - 1)th iteration). Therefore, at iteration n, the second term of the min function considers the impact of the reactions and compounds that are away from Rj by n edges in the graph for the metabolic network. Let E (Rj ) denote the set that contains the enzymes that produced the new damage rj . Along with rj , we also store E (Rj ). We update all rj VR using the same strategy. Note that the values rj can be updated in any order, i.e., the result does not depend on the order in which they are updated. Compound vector: The damage ck , k , 1 k |C |, is updated by considering the damage computed for Ck in the previous iteration and the damages of the reactions that produce Ck . Let R1 , R2 , · · · , Rj be the reactions that produce Ck . We first compute a set of enzymes as E (R1 ) E (R2 ) · · · E (Rj ). Here, E (Rt ), 1 t j , is the set of enzymes computed for Rt after the reaction vector is updated in the current iteration. We then update the damage value ck as ck = j min{ck , damage( i=1 E (Ri ))}. The first term here denotes the damage value computed for Ck in the previous iteration. The second term shows the damage computed for all the precursor reactions in the current step. Along with ck , we also store E (Ck ), the set of enzymes which provides the current minimum damage ck . Condition for convergence: At each iteration, each value in VR and VC either remains the same or decreases by an integer amount. This is because a min function Pacific Symposium on Biocomputing 12:88-99(2007) September 24, 2006 23:36 Proceedings Trim Size: 9in x 6in sridhar is applied to update each value as the minimum of the current value and a function of its precursors. Therefore, the values of VR and VC do not increase. Furthermore, a damage value is always an integer since it denotes the number of deleted nontarget compounds. We stop our iterative refinement steps when the vectors VR and VC do not change in two consecutive iterations. This is justified, because, if these two vectors remain the same after an iteration, it implies that the damage values in VR and VC cannot be minimized any more using our refinement strategy. Columns I1 and I2 in Table 1 show the iterative steps to update the values of the vectors VR and VC . In I0 , we compute the damage r1 for R1 as the minimum of its current damage (three) and the damage of its precursor compound, c5 = 1. Hence, r1 is updated to 1 and its associated enzyme set is changed to {E2 , E3 }. The other values in VR remain the same. When we compute the values for VC , c1 is updated to 1, as its new associated enzyme set is {E2 , E3 } and the damage of inhibiting both E2 and E3 together is 1. Hence, VR = [1, 3, 0, 0] and VC = [1, 3, 3, 3, 1]. In I2 , we find that the values in VR and VC do not change anymore. Hence, we stop our iterative refinement and report the enzyme combination E2 , E3 as the iterative solution for stopping the production of the target compound, C1 . Complexity analysis: Space Complexity: The number of elements in the reaction and compound vectors is (|R| + |C |). For each element, we store an associated set of enzymes. Hence, the space complexity is O((|R| + |C |) |E |). Time Complexity: The number of iterations of the algorithm is O(|R|) (see Section 4). The computational time per iteration is O(G (|R| + |C |)), where G is the size of the graph. Hence, the time complexity is O(|R|G (|R| + |C |)). 4. Maximum number of iterations In this section, we present a theoretical analysis of our proposed algorithm. We show that the number of iterations for the method to converge is finite. This is because the number of iterations is dependent on the length of the longest non-selfintersecting path (see Definitions below) from any enzyme to a reaction or compound. Definition 4.1. In a given metabolic network, a non-self-intersecting path is a path which traces any vertex on the path exactly once. For simplicity, we will use the term path instead of non-self-intersecting path in the rest of this section. Definition 4.2. In a given metabolic network, the length of a path from an enzyme Ei to a reaction Rj or compound Ck is defined as the number of unique reactions on that path. Note that the reaction Rj is counted as one of the unique reactions on the path from enzyme Ei to Rj . Pacific Symposium on Biocomputing 12:88-99(2007) September 24, 2006 23:36 Proceedings Trim Size: 9in x 6in sridhar Definition 4.3. In a given metabolic network, the preceding path of a reaction Rj (or a compound Ck ) is defined as the length of the longest path from any enzyme in that network to Rj (or Ck ). Theorem 4.1. Let VE = [e1 , e2 , · · · , e|E | ], VR = [r1 , r2 , · · · , r|R| ], and VC = [c1 , c2 , · · · , c|C | ] be the enzyme, reaction and compound vectors respectively (see Section 3). Let n be the length of the longest path (see Definitions 4.2 and 4.1) from any enzyme Ei to a reaction Rj (or a compound Ck ). The value rj (or ck ) remains constant after at most n iterations. Proof: We prove this theorem by an induction on the number of reactions on the longest path (see Definitions 4.2 and 4.1) from any enzyme in Ei corresponding to ei VE to Ck . Basis: The basis is the case when the longest path from an enzyme Ei is of length 1 (i.e., the path consists of exactly one reaction). Let Rj be such a reaction. This implies that there is no other reaction on a path from any Ei to Rj . As a result, the value rj remains constant after initialization. Let Ck be a compound such that there is at most one reaction from any enzyme to Ck . Let R1 , R2 , · · · , Rj be the reactions that produce Ck . Because of our assumption there is no precursor reaction to any of these reactions. Otherwise, the length of the longest path would be greater than one. Therefore, the values r1 , r2 , · · · , rj and the sets E (R1 ), E (R2 ), · · · , E (Rj ) do not change after initialization. The value ck is computed as the damage of E (Ck ) = E (R1 ) E (R2 ) · · · E (Rj ). Thus, ck remains unchanged after initialization and the algorithm terminates after the first iteration. Inductive step: Assume that the theorem is true for reactions and compounds that have a preceding path with at most n - 1 reactions. Now, we will prove the theorem for reactions and compounds that have a preceding path with n reactions. Assume that Rj and Ck denote such a reaction and a compound. We will prove the theorem for each one separately. Proof for Rj : Let C1 , C2 , · · · , Ct be the compounds that are input to Rj . The preceding path length of each of these input compounds, say Cs is at most n. Otherwise, the preceding path length of Rj would be greater than n. Case 1: If the preceding path length of Cs is less than n, by our induction hypothesis, cs would remain constant after (n - 1)th iteration. Thus, the input compound Cs will not change the value of rj after nth iteration. Case 2: If the preceding path length of Cs is n, then Rj is one of the reactions on this path. In other words, Cs and Rj are on a cycle of length n. Otherwise, the preceding path length of Rj would be greater than n. Recall that at each iteration, the algorithm considers a new reaction or a compound on the preceding path starting from the closest one. Thus, at nth iteration of computation of rj , the algorithm completes the cycle and considers Rj . This however will not modify rj . This is because the value of rj monotonically decreases (or remains the same) at each iteration. Thus, the initial damage value computed from Rj is guaranteed to be no Pacific Symposium on Biocomputing 12:88-99(2007) September 24, 2006 23:36 Proceedings Trim Size: 9in x 6in sridhar better than rj after n - 1 iterations. We conclude that rj will remain unchanged after nth iteration. Proof for Ck : Let R1 , R2 , · · · , Rj be the reactions that produce Ck . The preceding path length of each of these reactions, say Rs is at most n. Otherwise, the preceding path length of Ck would be greater than n. Case 1: If the preceding path length of Rs is less than n, by our induction hypothesis rs would remain constant after (n - 1)th iteration. Thus, the reaction Rs will not change the value of ck after nth iteration. Case 2: If the preceding path length of Rs is n, then from our earlier discussion for proof of Rj , rs remains unchanged after nth iteration. Therefore Rs will not change the value of ck after nth iteration. Hence, by induction, we show that the Theorem 4.1 holds. 5. Experimental results We evaluate our proposed iterative algorithm using the following three criteria: Execution time: The total time (in milliseconds) taken by the method to finish execution and report if a feasible solution is identified or not. Number of iterations: The number of iterations performed by the method to arrive at a steady-state solution. Average damage: The average number of non-target compounds that are eliminated when the enzymes in the result set are inhibited. We extracted the metabolic network information of Escherichia Coli (E.Coli) from KEGG 19 (ftp://ftp.genome.jp/pub/kegg/pathways/eco/). The metabolic network in KEGG has been hierarchically classified into smaller networks according to their functionality. We performed experiments at different levels of hierarchy of the metabolic network and on the entire metabolic network, that is an aggregation of all the functional subnetworks. We devised a uniform labeling scheme for the networks based on the number of enzymes. According to this scheme, a network label begins with `N' and is followed by the number of enzymes in the network. For instance, `N20' indicates a network with 20 enzymes. Table 2 shows the metabolic networks chosen, along with their identifiers and the number of compounds (C), reactions (R) and edges (Ed). The edges represent the interactions in the network. For each network, we constructed query sets of sizes one, two and four target compounds, by randomly choosing compounds from that network. Each query set contains 10 queries each. We implemented the proposed iterative algorithm and an exhaustive search algorithm which determines the optimal enzyme combination to eliminate the given set of target compounds with minimum damage. We implemented the algorithms in Java. We ran our experiments on an Intel Pentium 4 processor with 2.8 GHz clock speed and 1-GB main memory, running Linux operating system. Evaluation of Accuracy: Table 3 shows the comparison of the average damage values of the solutions computed by the iterative algorithm versus the exhaustive Pacific Symposium on Biocomputing 12:88-99(2007) September 24, 2006 23:36 Proceedings Trim Size: 9in x 6in sridhar Table 2. Metabolic networks from KEGG with identifier (Id). C, R and Ed denote the number of compounds, reactions and edges (interactions) respectively. Id N08 N13 N14 N17 N20 N22 N24 N28 N32 Metabolic Network Polyketide biosynthesis Xenobiotics biodegradation Citrate or TCA cycle Galactose Pentose phosphate Glycan Biosynthesis Glycerolipid Glycine, serine and threonine Pyruvate C 11 47 21 38 26 54 32 36 21 R 11 58 35 50 37 51 49 46 51 Ed 33 187 125 172 129 171 160 151 163 Id N42 N48 N52 N59 N71 N96 N170 N180 N537 Metabolic Network Other amino acid Lipid Purine Energy Nucleotide Vitamins and Cofactors Amino acid Carbohydrate Entire Network C 69 134 67 72 102 145 54 247 988 R 63 196 128 82 217 175 378 501 1790 Ed 208 654 404 268 684 550 1210 1659 5833 Table 3. Comparison of average damage values of solutions determined by the iterative algorithm versus the exhaustive search algorithm. Pathway Id Iterative Damage Exhaustive Damage N 14 2.51 2.51 N 17 8.73 8.73 N 20 1.63 1.63 N 24 3.39 3.17 N 28 1.47 1.47 N 32 0.59 0.59 10000 Average number of iterations N170 N180 N537 N08 N13 N14 N17 N20 N22 N24 N28 N32 N42 N48 N52 N59 N71 N96 Average execution time (ms) 10 9 8 7 6 5 4 3 2 1 0 N170 N180 N537 N08 N13 N14 N17 N20 N22 N24 N28 N32 N42 N48 N52 N59 N71 N96 1000 100 10 1 Pathway Identifier Pathway Identifier (a) (b) Figure 2. Evaluation of iterative algorithm. (a)Average execution time in milliseconds. (b)Average number of iterations search algorithm. We have shown the results only upto N 32, as the exhaustive search algorithm took longer than one day to finish even for N 32. We can see that the damage values of our method exactly match the damage values of the exhaustive search for all the networks except N 24. For N 24, the average damage differs from the exhaustive solution by only 0.02%. This shows that the iterative algorithm is a good approximation of the exhaustive search algorithm which computes an optimal solution. The slight deviation in damage is the tradeoff for achieving the scalability of the iterative algorithm (described next). Evaluation of Scalability: Figure 2(a) plots the average execution time of our it- Pacific Symposium on Biocomputing 12:88-99(2007) September 24, 2006 23:36 Proceedings Trim Size: 9in x 6in sridhar erative method for increasing sizes of metabolic networks. The running time increases slowly with the network size. As the number of enzymes increases from 8 to 537, the running time increases from roughly 1 to 10 seconds. The largest network, N 537, consists of 537 enzymes, and hence, an exhaustive evaluation inspects 2537 - 1 combinations (which is computationally infeasible). Thus, our results show that the iterative method scales well for networks of increasing sizes. This property makes our method an important tool for identifying the right enzyme combination for eliminating target compounds, especially for those networks for which an exhaustive search is not feasible. Figure 2(b) shows a plot of the average number of iterations for increasing sizes of metabolic networks. The iterative method reaches a steady state within 10 iterations in all cases. The various parameters (see Table 2) that influence the number of iterations are the number of enzymes, compounds, reactions and especially the number of interactions in the network (represented by edges in the network graph). Larger number of interactions increase the number of iterations considerably, as can be seen for networks N 22, N 48, N 96, N 537, where the number of iterations is greater than 5. This shows that, in addition to the number of enzymes, the number of compounds and reactions in the network and their interactions also play a significant role in determining the number of iterations. Our results show that the iterative algorithm can reliably reach a steady state and terminate, for networks as large as the entire metabolic network of E.Coli. 6. Conclusion Efficient computational strategies are needed to identify the enzymes (i.e., drug targets), whose inhibition will achieve the required effect of eliminating a given target set of compounds while incurring minimal side-effects. An exhaustive evaluation of all possible enzyme combinations to find the optimal subset is computationally infeasible for large metabolic networks. We proposed a scalable iterative algorithm which computes a sub-optimal solution to this problem within reasonable timebounds. Our algorithm is based on the intuition that we can arrive at a solution close to the optimal one by tracing backward from the target compounds. We evaluated the immediate precursors of a target compound and iteratively moved backwards, to identify the enzymes, whose inhibition stopped the production of the target compound while incurring minimum damage. We showed that our method converges within a finite number of such iterations. In our experiments on E.Coli metabolic network, the accuracy of a solution computed by the iterative algorithm deviated from that found by an an exhaustive search only by 0.02 %. Our iterative algorithm is highly scalable. It solved the problem for even the entire metabolic network of E.Coli in less than 10 seconds. References 1. `Proteome Mining' can zero in on Drug Targets. Duke University medical news, Aug 2004. 2. M Arita. The metabolic world of Escherichia coli is not small. PNAS, 101(6):1543­7, Pacific Symposium on Biocomputing 12:88-99(2007) September 24, 2006 23:36 Proceedings Trim Size: 9in x 6in sridhar 2004. 3. S. Broder and J. C. Venter. Sequencing the Entire Genomes of Free-Living Organisms: The Foundation of Pharmacology in the New Millennium. Annual Review of Pharmacology and Toxicology, 40:97­132, Apr 2000. 4. S. K. Chanda and J. S. Caldwell. Fulfilling the promise: Drug discovery in the postgenomic era. Drug Discovery Today, 8(4):168­174, Feb 2003. 5. A. Cornish-Bowden. Why is uncompetitive inhibition so rare? FEBS Letters, 203(1):3­ 6, Jul 1986. 6. A. Cornish-Bowden and J. S. Hofmeyr. The Role of Stoichiometric Analysis in Studies of Metabolism: An Example. Journal of Theoretical Biology, 216:179­191, May 2002. 7. J Drews. Drug Discovery: A Historical Perspective. Science, 287(5460):1960­1964, Mar 2000. 8. Davidov et. al. Advancing drug discovery through systems biology. Drug Discovery Today, 8(4):175­183, Feb 2003. 9. Deane et. al. Catechol-o-methyltransferase inhibitors versus active comparators for levodopa-induced complications in parkinson's disease. Cochrane Database of Systematic Reviews, 4, 2004. 10. Hatzimanikatis et. al. Metabolic networks: enzyme function and metabolite structure. Current Opinion in Structural Biology, (14):300­306, 2004. 11. Imielinski et. al. Investigating metabolite essentiality through genome scale analysis of E. coli production capabilities. Bioinformatics, Jan 2005. 12. Imoto et. al. Computational Strategy for Discovering Druggable Gene Networks from Genome-Wide RNA Expression Profiles. In PSB 2006 Online Proceedings, 2006. 13. Jeong et. al. Prediction of Protein Essentiality Based on Genomic Data. ComPlexUs, 1:19­28, 2003. 14. Lemke et. al. Essentiality and damage in metabolic networks. Bioinformatics, 20(1):115­119, Jan 2004. 15. Ma et. al. Decomposition of metabolic network into functional modules based on the global connectivity structure of reaction graph. Bioinformatics, 20(12):1870­6, 2004. 16. Mombach et. al. Bioinformatics analysis of mycoplasma metabolism: Important enzymes, metabolic similarities, and redundancy. Computers in Biology and Medicine, 2005. 17. Teichmann et. al. The Evolution and Structural Anatomy of the Small Molecule Metabolic Pathways in Escherichia coli. JMB, 311:693­708, 2001. 18. Yeh et. al. Computational Analysis of Plasmodium falciparum Metabolism: Organizing Genomic Information to Facilitate Drug Discovery. Genome Research, 14:917­924, 2004. 19. M Kanehisa and S Goto. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res., 28(1):27­30, Jan 2000. 20. C Smith. Hitting the target. Nature, 422:341­347, Mar 2003. 21. R. Somogyi and C.A. Sniegoski. Modeling the complexity of genetic networks: Understanding multi-gene and pleiotropic regulation. Complexity, 1:45­63, 1996. 22. P. Sridhar, T. Kahveci, and S. Ranka. Opmet: A metabolic network-based algorithm for optimal drug target identification. Technical report, CISE Department, University of Florida, Sep 2006. 23. R. Surtees and N. Blau. The neurochemistry of phenylketonuria. European Journal of Pediatrics, 159:109­13, 2000. 24. Takenaka T. Classical vs reverse pharmacology in drug discovery. BJU International, 88(2):7­10, Sep 2001.