Metabolic Pathfinding Using RPAIR Annotation

https://doi.org/10.1016/j.jmb.2009.03.006Get rights and content

Abstract

Metabolic databases contain information about thousands of small molecules and reactions, which can be represented as networks. In the context of metabolic reconstruction, pathways can be inferred by searching optimal paths in such networks. A recurrent problem is the presence of pool metabolites (e.g., water, energy carriers, and cofactors), which are connected to hundreds of reactions, thus establishing irrelevant shortcuts between nodes of the network. One solution to this problem relies on weighted networks to penalize highly connected compounds. A more refined solution takes the chemical structure of reactants into account in order to differentiate between side and main compounds of a reaction. Thanks to an intensive annotation effort at KEGG, decompositions of reactions into reactant pairs (RPAIR) categorized by their role (main, trans, cofac, ligase, and leave) are now available.

The goal of this article is to evaluate the impact of RPAIR data on pathfinding in metabolic networks. To this end, we measure the impact of different parameters concerning the construction of the metabolic network: mapping of reactions and reactant pairs onto a graph, use of selected categories of reactant pairs, weighting schemes for compounds and reactions, removal of highly connected metabolites, and reaction directionality. In total, we tested 104 combinations of parameters and identified their optimal values for pathfinding on the basis of 55 reference pathways from three organisms.

The best-performing metabolic network combines the biochemical knowledge encoded by KEGG RPAIR with a weighting scheme penalizing highly connected compounds. With this network, we could recover reference pathways from Escherichia coli with an average accuracy of 93% (32 pathways), from Saccharomyces cerevisiae with an average accuracy of 66% (11 pathways), and from humans with an average accuracy of 70% (12 pathways). Our pathfinding approach is available as part of the Network Analysis Tools.

Introduction

In biochemical textbooks1 and in databases such as KEGG2, 3 or BioCyc,4, 5 metabolic information is represented in the form of generic or organism-specific metabolic maps or pathways. Historically, the characterization of metabolic pathways relied on a very few model organisms (Escherichia coli, Salmonella, Saccharomyces cerevisiae, mouse, human, and so on). Enzyme grouping into pathways was defined on the basis of mutant phenotypes (e.g., methionine auxotrophy) and, with progress in biochemistry, detailed successions of reactions could be established. Such biochemical analyses revealed that different organisms can use alternative pathways to produce or utilize the same molecules. For example, in E. coli, l-lysine is produced in nine reactions from l-aspartate, whereas in S. cerevisiae, it is produced in eight reactions from 2-oxoglutarate. The documented examples of alternative pathways probably reflect a tiny part of the huge diversity of biochemical pathways.

The large size of the biochemical pathway space is also apparent from calculations performed by Kueffner et al., who constructed a PETRI net from several metabolic databases and enumerated all paths between glucose and pyruvate.6 Without further constraints, they obtained around 500,000 paths. Even after application of a number of constraints, as many as 170 pathways connecting glucose to pyruvate remained.

This demonstrates that detecting de novo pathways by exhaustive enumeration of paths will return many false positives, whereas relying on previous knowledge will overlook a large number of valid pathways. Automated metabolic pathfinding approaches can assist biochemists by proposing a reasonable number of hypothetic pathways ranked by potential relevance, which may then be filtered and modified on the basis of biochemical knowledge and validated experimentally.

Metabolic pathfinding can be applied to predict pathways between enzymes encoded by genes that are assumed to be functionally related (on the basis of coexpression, operon organization, phylogenetic profiles, gene fusion, and so on). This is especially useful in metabolic reconstruction, whose goal is to decipher organism-specific metabolism from genome data. Reconstruction strategies can be classified according to their extent of automation: In the absence of automation, data obtained from metabolic literature, expert knowledge, or databases are manually assembled into a metabolic network.4, 7, 8 Once a set of pathways is known, the reconstruction procedure can be automated, using known pathways as template. The reconstruction procedure can be entirely automated9, 10, 11, 12, 13, 14 or can rely on a computer-based assignment refined by manual annotation as in “Tier 2” BioCyc databases.12 Building a network from known pathways alone suffers from a serious limitation: The reconstructed network is restricted to previously characterized pathways. This does not allow the discovery of new alternatives. This limitation becomes apparent when reconstructed pathways contain gaps (i.e., reactions not catalyzed by any enzyme annotated in the query genome). A gap may occur because (1) the pathway is absent from the organism of interest and reactions identified for it belong to other pathways; (2) the enzyme-coding gene has not yet been identified in the query genome; (3) the organism uses a variant of the pathway that bypasses the gap (reasons 1–3 are listed in Paley and Karp15); or (4) the reaction is spontaneous in this organism. Metabolic pathfinding helps to address this problem: On one hand, it can suggest alternatives to the pathway in question and, on the other hand, it can be applied to identify and rank candidate gap filler enzymes.16, 17

Metabolic pathfinding relies on a metabolic network (equivalently called “metabolic graph”) where compounds and reactions are nodes linked by edges representing substrate/product relationships.6, 18, 19 Various algorithms can be used to find one (shortest pathfinding) or several (k-shortest pathfinding) paths between a given pair of start and end nodes. Unfortunately, the shortest paths found in a raw metabolic network generally do not correspond to relevant biochemical pathways.11, 20, 21 Indeed, a major problem of metabolic pathfinding is the presence of compounds involved in a high number of reactions. Typically, these compounds are cofactors (e.g., NADP+/NADPH and NAD+/NADH), small inorganic molecules (e.g., H2O, O2, and CO2), or energy carriers (e.g., ATP and ADP). The shortest pathfinding algorithms will use such highly connected compounds as shortcuts and will thus infer irrelevant pathways containing cofactors or energy carriers as intermediates (Fig. 1a).

Failure to deal with highly connected compounds will bias any topological analysis of metabolic networks that is based on pathfinding. For instance, Jeong et al. described a “small-world” property of metabolic networks, which states that each compound in the network can be reached from other compounds in a small number of steps.18 However, the “small-world” property was questioned by several authors because most of these short paths result from shortcuts linking two reactions via some irrelevant compounds.11, 20, 21, 22

A number of strategies have been devised to overcome the problem of highly connected compounds.

The concept of “pool compounds” has been introduced. Compounds in the pool are freely available to a reaction in the pathway, whereas other compounds have to be produced or consumed by the pathway. This concept is especially applied in flux balance analysis where pool compounds are called external metabolites, which do not need to be balanced (e.g., Schuster et al.,23 Teusink et al.,24 and Edwards and Palsson25).

Karp and Paley defined “main compounds” as those shared between subsequent reaction steps of a pathway: They form the “backbone” of that pathway.26 In contrast, “side compounds” are not involved in subsequent reaction steps.

Various authors16, 19, 22 have excluded a selected list of compounds from the metabolic network under study. However, since a good definition of pool compounds is missing, it is unclear which compounds are to be removed. In addition, removal of those compounds generally avoids irrelevant shortcuts, but occasionally prevents the finding of some pathways in which they are used as intermediates (e.g., ADP is a side compound in most pathways, but a main compound for the de novo biosynthesis of purine nucleotides).

Another strategy is to take into account the chemical structure of the compounds in order to trace, for each reaction, the transfer of atom groups between substrates and products.20, 27, 28 This assumes a reaction-specific definition of “main” and “side” substrates/products. Arita's concept defines “a pathway from X to Y [as] a sequence of biochemical reactions through which at least one carbon in X reaches Y.” This strategy of atom mapping and tracing introduced by Arita was applied in modified forms for pathfinding.29, 30, 31, 32 It requires knowledge of the structure of each compound and a tedious annotation of atom transfers within each reaction.

Pool compounds can also be avoided by defining a set of rules,33 assigned by experts on the basis of compound structure and biochemical literature, in order to enumerate all conversions allowed for each compound. The drawback of this approach is that the rule set might be incomplete or too restrictive to allow discovery of new pathways.

Kotera et al. introduced the concept of reactant pairs (RPAIRs), defined as “pairs of compounds that have atoms or atom groups in common on two sides of a reaction.”34, 35 RPAIR thus establishes pairwise relationships between one substrate and one product, taking into account the respective roles of these compounds in the reaction. Accordingly, reactant pairs are grouped into the following classes: main, trans, cofac, ligase, and leave. For instance, a cofac reactant pair couples a substrate cofactor with a product cofactor (i.e., A00002 pairs NADH with NAD+). Figure 1 shows typical examples of decomposition of reactions (Fig. 1a) into reactant pairs (Fig. 1b). This figure also demonstrates that biochemically irrelevant connections between ATP/ADP and other compounds are avoided in the RPAIR graph (Fig. 1c and d) simply because corresponding reactant pairs are absent in the RPAIR database. Note that the same reactant pair may take on different roles in different reactions (i.e., the ATP/orthophosphate reactant pair belonging to all five classes).

Oh et al. combined RPAIRs and structural comparisons to predict biodegradation pathways by iteratively matching a compound with its most similar partner in a library of reactant pairs.36

In a previous article, we introduced a strategy for predicting metabolic pathways based on k-shortest pathfinding in a weighted graph.37 By default, each compound is assigned a weight equal to its degree (“connectivity”) in the metabolic graph. The weight is then used as a penalty for pathfinding, leading to the concept of lightest pathfinding (as opposed to shortest). Thus, the more highly connected is a compound, the less likely it is to appear in an inferred pathway.

An important issue is the protocol used to assess the reliability of a pathway discovery method. Indeed, several approaches were tested on only a few study cases (e.g., McShan et al.,29 Rahman et al.,31 and de Figueiredo et al.38), or conclusions were drawn from global topological parameters without any attempt to compare discovered paths to annotated pathways (e.g., Jeong et al.18). In contrast, our strategy to penalize compounds according to their degree was supported by an evaluation on several tens of pathways.21

With the availability of the KEGG RPAIR database, we can now integrate manually compiled knowledge on atom flow as stored in the reactant pairs into pathfinding. In this article, we quantify the impact of KEGG RPAIR on pathfinding accuracy by carrying out a careful validation of known pathways from three organisms (E. coli, S. cerevisiae, and Homo sapiens). In addition, we evaluate the effects of parameters linked to the RPAIR database [e.g., the filtering and weighting of reactant pairs according to their class (main, trans, cofac, ligase, or leave)]. In order to compare results obtained with KEGG RPAIR to those obtained from our previous method, we also measure the impact of different weighting schemes and the filtering of pool metabolites on pathfinding accuracy.

We added the new version of the metabolic pathfinding tool and the best-performing metabolic networks to the Network Analysis Tools Web server.39

Section snippets

Graph construction and evaluation

We summarize here the main features of the metabolic graphs and pathways used for evaluation. The detailed description can be found in Materials and Methods.

We constructed three bipartite metabolic graphs from KEGG/LIGAND. The first one, called reaction graph, is built from all compounds and reactions, in the same way as in our previous work.21 The second graph, named RPAIR graph, consists of all reactant pairs listed in KEGG RPAIR and their associated compounds. The third graph, called

Conclusion

In previous studies,21, 37 we evaluated the inference of metabolic pathways by pathfinding in three alternative graph types (raw, compound-filtered, and compound-weighted) and showed that pathfinding accuracy is strongly improved by weighting compounds according to their degree in the metabolic graph. We now extended this analysis by comparing 104 combinations of parameters and by quantifying the impact of each parameter on the accuracy of the inferred pathways. In particular, we assessed the

Network construction

We constructed three different networks from KEGG/LIGAND (release 41.0).

The first graph, named reaction graph, was built from all reactions and compounds present in KEGG/LIGAND. The resulting graph is composed of 6359 reactions and 5312 compounds, which are connected by 26,786 edges.

The second graph, termed RPAIR graph, was constructed from all reactant pairs (7058) in KEGG RPAIR (release 41.0) and all compounds involved in at least one of these reactant pairs (4297). It has fewer compounds

Acknowledgements

We would like to thank Jean-Noël Monette and Pierre Schaus for making available useful modifications to the REA algorithm. They also implemented a Python wrapper around REA that inspired the Java wrapper employed for pathfinding evaluation.

We acknowledge the aMAZE team for their support during the initial stages of this work. We used their Java graph library and, in addition, one of them (Olivier Hubaut) suggested the graph transformation that extended pathfinding with two nodes to pathfinding

References (54)

  • KarpP.D. et al.

    EcoCyc: an encyclopedia of Escherichia coli genes and metabolism

    Nucleic Acids Res.

    (1996)
  • CaspiR. et al.

    The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases

    Nucleic Acids Res.

    (2008)
  • KueffnerR. et al.

    Pathway analysis in metabolic databases via differential metabolic display

    Bioinformatics

    (2000)
  • FörsterJ. et al.

    Genome-scale reconstruction of the Saccharomyces cerevisiae metabolic network

    Genome Res.

    (2003)
  • DuarteN.C. et al.

    Global reconstruction of the human metabolic network based on genomic and bibliomic data

    Proc. Natl Aced. Sci.

    (2007)
  • GaasterlandT. et al.

    Reconstruction of metabolic networks using incomplete information

    Proc. Int. Conf. Intell. Syst. Mol. Biol.

    (1995)
  • MaH. et al.

    Reconstruction of metabolic networks from genome data and analysis of their global structure for various organisms

    Bioinformatics

    (2003)
  • KarpP.D. et al.

    Expansion of the BioCyc collection of pathway/genome databases to 160 genomes

    Nucleic Acids Res.

    (2005)
  • MoriyaY. et al.

    KAAS: an automatic genome annotation and pathway reconstruction server

    Nucleic Acids Res.

    (2007)
  • DeJonghM. et al.

    Toward the automated generation of genome-scale metabolic networks in the SEED

    BMC Bioinf.

    (2007)
  • PaleyS.M. et al.

    Evaluation of computational metabolic-pathway predictions for Helicobacter pylori

    Bioinformatics

    (2002)
  • KharchenkoP. et al.

    Filling gaps in a metabolic network using expression information

    Bioinformatics

    (2004)
  • KharchenkoP. et al.

    Identifying metabolic enzymes with multiple types of association evidence

    BMC Bioinf.

    (2006)
  • JeongH. et al.

    The large-scale organization of metabolic networks

    Nature

    (2000)
  • FellD.A. et al.

    The small world of metabolism

    Metab. Eng.

    (2000)
  • AritaM.

    The metabolic world of Escherichia coli is not small

    Proc. Natl Acad. Sci.

    (2004)
  • van HeldenJ. et al.

    Graph-based analysis of metabolic networks

    Ernst Schering Res. Found. Workshop

    (2002)
  • Cited by (50)

    • Retrosynthetic design of metabolic pathways to chemicals not found in nature

      2019, Current Opinion in Systems Biology
      Citation Excerpt :

      Different networks can be used, depending on the engineering problem being addressed. For example, if the goal is to identify alternate routes through metabolism to increase flux through genetic engineering, then only the network associated with the chassis organism needs to be searched (e.g., E. coli has 3755 metabolites) [197–203]. The goal could also be to define metabolic pathways from other organisms that could be moved into the target host [202,204–**206].

    View all citing articles on Scopus
    View full text