Metabolic Pathfinding Using RPAIR Annotation

doi:10.1016/j.jmb.2009.03.006

Journal of Molecular Biology

Volume 388, Issue 2, 1 May 2009, Pages 390-414

https://doi.org/10.1016/j.jmb.2009.03.006 Get rights and content

Abstract

Metabolic databases contain information about thousands of small molecules and reactions, which can be represented as networks. In the context of metabolic reconstruction, pathways can be inferred by searching optimal paths in such networks. A recurrent problem is the presence of pool metabolites (e.g., water, energy carriers, and cofactors), which are connected to hundreds of reactions, thus establishing irrelevant shortcuts between nodes of the network. One solution to this problem relies on weighted networks to penalize highly connected compounds. A more refined solution takes the chemical structure of reactants into account in order to differentiate between side and main compounds of a reaction. Thanks to an intensive annotation effort at KEGG, decompositions of reactions into reactant pairs (RPAIR) categorized by their role (main, trans, cofac, ligase, and leave) are now available.

The goal of this article is to evaluate the impact of RPAIR data on pathfinding in metabolic networks. To this end, we measure the impact of different parameters concerning the construction of the metabolic network: mapping of reactions and reactant pairs onto a graph, use of selected categories of reactant pairs, weighting schemes for compounds and reactions, removal of highly connected metabolites, and reaction directionality. In total, we tested 104 combinations of parameters and identified their optimal values for pathfinding on the basis of 55 reference pathways from three organisms.

The best-performing metabolic network combines the biochemical knowledge encoded by KEGG RPAIR with a weighting scheme penalizing highly connected compounds. With this network, we could recover reference pathways from Escherichia coli with an average accuracy of 93% (32 pathways), from Saccharomyces cerevisiae with an average accuracy of 66% (11 pathways), and from humans with an average accuracy of 70% (12 pathways). Our pathfinding approach is available as part of the Network Analysis Tools.

Introduction

In biochemical textbooks¹ and in databases such as KEGG2, 3 or BioCyc,4, 5 metabolic information is represented in the form of generic or organism-specific metabolic maps or pathways. Historically, the characterization of metabolic pathways relied on a very few model organisms (Escherichia coli, Salmonella, Saccharomyces cerevisiae, mouse, human, and so on). Enzyme grouping into pathways was defined on the basis of mutant phenotypes (e.g., methionine auxotrophy) and, with progress in biochemistry, detailed successions of reactions could be established. Such biochemical analyses revealed that different organisms can use alternative pathways to produce or utilize the same molecules. For example, in E. coli, l-lysine is produced in nine reactions from l-aspartate, whereas in S. cerevisiae, it is produced in eight reactions from 2-oxoglutarate. The documented examples of alternative pathways probably reflect a tiny part of the huge diversity of biochemical pathways.

The large size of the biochemical pathway space is also apparent from calculations performed by Kueffner et al., who constructed a PETRI net from several metabolic databases and enumerated all paths between glucose and pyruvate.⁶ Without further constraints, they obtained around 500,000 paths. Even after application of a number of constraints, as many as 170 pathways connecting glucose to pyruvate remained.

This demonstrates that detecting de novo pathways by exhaustive enumeration of paths will return many false positives, whereas relying on previous knowledge will overlook a large number of valid pathways. Automated metabolic pathfinding approaches can assist biochemists by proposing a reasonable number of hypothetic pathways ranked by potential relevance, which may then be filtered and modified on the basis of biochemical knowledge and validated experimentally.

Metabolic pathfinding can be applied to predict pathways between enzymes encoded by genes that are assumed to be functionally related (on the basis of coexpression, operon organization, phylogenetic profiles, gene fusion, and so on). This is especially useful in metabolic reconstruction, whose goal is to decipher organism-specific metabolism from genome data. Reconstruction strategies can be classified according to their extent of automation: In the absence of automation, data obtained from metabolic literature, expert knowledge, or databases are manually assembled into a metabolic network.4, 7, 8 Once a set of pathways is known, the reconstruction procedure can be automated, using known pathways as template. The reconstruction procedure can be entirely automated9, 10, 11, 12, 13, 14 or can rely on a computer-based assignment refined by manual annotation as in “Tier 2” BioCyc databases.¹² Building a network from known pathways alone suffers from a serious limitation: The reconstructed network is restricted to previously characterized pathways. This does not allow the discovery of new alternatives. This limitation becomes apparent when reconstructed pathways contain gaps (i.e., reactions not catalyzed by any enzyme annotated in the query genome). A gap may occur because (1) the pathway is absent from the organism of interest and reactions identified for it belong to other pathways; (2) the enzyme-coding gene has not yet been identified in the query genome; (3) the organism uses a variant of the pathway that bypasses the gap (reasons 1–3 are listed in Paley and Karp¹⁵); or (4) the reaction is spontaneous in this organism. Metabolic pathfinding helps to address this problem: On one hand, it can suggest alternatives to the pathway in question and, on the other hand, it can be applied to identify and rank candidate gap filler enzymes.16, 17

Metabolic pathfinding relies on a metabolic network (equivalently called “metabolic graph”) where compounds and reactions are nodes linked by edges representing substrate/product relationships.6, 18, 19 Various algorithms can be used to find one (shortest pathfinding) or several (k-shortest pathfinding) paths between a given pair of start and end nodes. Unfortunately, the shortest paths found in a raw metabolic network generally do not correspond to relevant biochemical pathways.11, 20, 21 Indeed, a major problem of metabolic pathfinding is the presence of compounds involved in a high number of reactions. Typically, these compounds are cofactors (e.g., NADP⁺/NADPH and NAD⁺/NADH), small inorganic molecules (e.g., H₂O, O₂, and CO₂), or energy carriers (e.g., ATP and ADP). The shortest pathfinding algorithms will use such highly connected compounds as shortcuts and will thus infer irrelevant pathways containing cofactors or energy carriers as intermediates (Fig. 1a).

Failure to deal with highly connected compounds will bias any topological analysis of metabolic networks that is based on pathfinding. For instance, Jeong et al. described a “small-world” property of metabolic networks, which states that each compound in the network can be reached from other compounds in a small number of steps.¹⁸ However, the “small-world” property was questioned by several authors because most of these short paths result from shortcuts linking two reactions via some irrelevant compounds.11, 20, 21, 22

A number of strategies have been devised to overcome the problem of highly connected compounds.

The concept of “pool compounds” has been introduced. Compounds in the pool are freely available to a reaction in the pathway, whereas other compounds have to be produced or consumed by the pathway. This concept is especially applied in flux balance analysis where pool compounds are called external metabolites, which do not need to be balanced (e.g., Schuster et al.,²³ Teusink et al.,²⁴ and Edwards and Palsson²⁵).

Karp and Paley defined “main compounds” as those shared between subsequent reaction steps of a pathway: They form the “backbone” of that pathway.²⁶ In contrast, “side compounds” are not involved in subsequent reaction steps.

Various authors16, 19, 22 have excluded a selected list of compounds from the metabolic network under study. However, since a good definition of pool compounds is missing, it is unclear which compounds are to be removed. In addition, removal of those compounds generally avoids irrelevant shortcuts, but occasionally prevents the finding of some pathways in which they are used as intermediates (e.g., ADP is a side compound in most pathways, but a main compound for the de novo biosynthesis of purine nucleotides).

Another strategy is to take into account the chemical structure of the compounds in order to trace, for each reaction, the transfer of atom groups between substrates and products.20, 27, 28 This assumes a reaction-specific definition of “main” and “side” substrates/products. Arita's concept defines “a pathway from X to Y [as] a sequence of biochemical reactions through which at least one carbon in X reaches Y.” This strategy of atom mapping and tracing introduced by Arita was applied in modified forms for pathfinding.29, 30, 31, 32 It requires knowledge of the structure of each compound and a tedious annotation of atom transfers within each reaction.

Pool compounds can also be avoided by defining a set of rules,³³ assigned by experts on the basis of compound structure and biochemical literature, in order to enumerate all conversions allowed for each compound. The drawback of this approach is that the rule set might be incomplete or too restrictive to allow discovery of new pathways.

Kotera et al. introduced the concept of reactant pairs (RPAIRs), defined as “pairs of compounds that have atoms or atom groups in common on two sides of a reaction.”34, 35 RPAIR thus establishes pairwise relationships between one substrate and one product, taking into account the respective roles of these compounds in the reaction. Accordingly, reactant pairs are grouped into the following classes: main, trans, cofac, ligase, and leave. For instance, a cofac reactant pair couples a substrate cofactor with a product cofactor (i.e., A00002 pairs NADH with NAD⁺). Figure 1 shows typical examples of decomposition of reactions (Fig. 1a) into reactant pairs (Fig. 1b). This figure also demonstrates that biochemically irrelevant connections between ATP/ADP and other compounds are avoided in the RPAIR graph (Fig. 1c and d) simply because corresponding reactant pairs are absent in the RPAIR database. Note that the same reactant pair may take on different roles in different reactions (i.e., the ATP/orthophosphate reactant pair belonging to all five classes).

Oh et al. combined RPAIRs and structural comparisons to predict biodegradation pathways by iteratively matching a compound with its most similar partner in a library of reactant pairs.³⁶

In a previous article, we introduced a strategy for predicting metabolic pathways based on k-shortest pathfinding in a weighted graph.³⁷ By default, each compound is assigned a weight equal to its degree (“connectivity”) in the metabolic graph. The weight is then used as a penalty for pathfinding, leading to the concept of lightest pathfinding (as opposed to shortest). Thus, the more highly connected is a compound, the less likely it is to appear in an inferred pathway.

An important issue is the protocol used to assess the reliability of a pathway discovery method. Indeed, several approaches were tested on only a few study cases (e.g., McShan et al.,²⁹ Rahman et al.,³¹ and de Figueiredo et al.³⁸), or conclusions were drawn from global topological parameters without any attempt to compare discovered paths to annotated pathways (e.g., Jeong et al.¹⁸). In contrast, our strategy to penalize compounds according to their degree was supported by an evaluation on several tens of pathways.²¹

With the availability of the KEGG RPAIR database, we can now integrate manually compiled knowledge on atom flow as stored in the reactant pairs into pathfinding. In this article, we quantify the impact of KEGG RPAIR on pathfinding accuracy by carrying out a careful validation of known pathways from three organisms (E. coli, S. cerevisiae, and Homo sapiens). In addition, we evaluate the effects of parameters linked to the RPAIR database [e.g., the filtering and weighting of reactant pairs according to their class (main, trans, cofac, ligase, or leave)]. In order to compare results obtained with KEGG RPAIR to those obtained from our previous method, we also measure the impact of different weighting schemes and the filtering of pool metabolites on pathfinding accuracy.

We added the new version of the metabolic pathfinding tool and the best-performing metabolic networks to the Network Analysis Tools Web server.³⁹

Section snippets

Graph construction and evaluation

We summarize here the main features of the metabolic graphs and pathways used for evaluation. The detailed description can be found in Materials and Methods.

We constructed three bipartite metabolic graphs from KEGG/LIGAND. The first one, called reaction graph, is built from all compounds and reactions, in the same way as in our previous work.²¹ The second graph, named RPAIR graph, consists of all reactant pairs listed in KEGG RPAIR and their associated compounds. The third graph, called

Conclusion

In previous studies,21, 37 we evaluated the inference of metabolic pathways by pathfinding in three alternative graph types (raw, compound-filtered, and compound-weighted) and showed that pathfinding accuracy is strongly improved by weighting compounds according to their degree in the metabolic graph. We now extended this analysis by comparing 104 combinations of parameters and by quantifying the impact of each parameter on the accuracy of the inferred pathways. In particular, we assessed the

Network construction

We constructed three different networks from KEGG/LIGAND (release 41.0).

The first graph, named reaction graph, was built from all reactions and compounds present in KEGG/LIGAND. The resulting graph is composed of 6359 reactions and 5312 compounds, which are connected by 26,786 edges.

The second graph, termed RPAIR graph, was constructed from all reactant pairs (7058) in KEGG RPAIR (release 41.0) and all compounds involved in at least one of these reactant pairs (4297). It has fewer compounds

Acknowledgements

We would like to thank Jean-Noël Monette and Pierre Schaus for making available useful modifications to the REA algorithm. They also implemented a Python wrapper around REA that inspired the Java wrapper employed for pathfinding evaluation.

We acknowledge the aMAZE team for their support during the initial stages of this work. We used their Java graph library and, in addition, one of them (Olivier Hubaut) suggested the graph transformation that extended pathfinding with two nodes to pathfinding

References (54)

M. Arita
Metabolic reconstruction using shortest paths
Simul. Pract. Theory
(2000)
CroesD. et al.
Inferring meaningful pathways in weighted metabolic networks
J. Mol. Biol.
(2006)
SchusterS. et al.
Detection of elementary flux modes in biochemical networks: a promising tool for pathway analysis and metabolic engineering
TIBTECH
(1999)
TeusinkB. et al.
Analysis of growth of Lactobacillus plantarum WCFS1 on a complex medium using a genome-scale metabolic model
J. Biol. Chem.
(2006)
PasqualiniJ.R.
Enzymes involved in the formation and transformation of steroid hormones in the fetal and placental compartments
J. Steroid Biochem. Mol. Biol.
(2005)
FélixL. et al.
Validation of metabolic pathway databases based on chemical substructure search
Biomol. Eng.
(2007)
DuinC.W. et al.
Solving group Steiner problems as Steiner problems
Eur. J. Oper. Res.
(2004)
BergJ.M. et al.
GotoS. et al.
Organizing and computing metabolic pathway data in terms of binary relations
Pac. Symp. Biocomput.
(1997)
KanehisaM. et al.
KEGG for linking genomes to life and the environment
Nucleic Acids Res.
(2008)

AritaM.

The metabolic world of Escherichia coli is not small

Proc. Natl Acad. Sci.

(2004)

van HeldenJ. et al.

Graph-based analysis of metabolic networks

Ernst Schering Res. Found. Workshop

(2002)

Cited by (50)

Gene-metabolite annotation with shortest reactional distance enhances metabolite genome-wide association studies results
2023, iScience
Metabolite genome-wide association studies (mGWAS) have advanced our understanding of the genetic control of metabolite levels. However, interpreting these associations remains challenging due to a lack of tools to annotate gene-metabolite pairs beyond the use of conservative statistical significance threshold. Here, we introduce the shortest reactional distance (SRD) metric, drawing from the comprehensive KEGG database, to enhance the biological interpretation of mGWAS results. We applied this approach to three independent mGWAS, including a case study on sickle cell disease patients. Our analysis reveals an enrichment of small SRD values in reported mGWAS pairs, with SRD values significantly correlating with mGWAS p values, even beyond the standard conservative thresholds. We demonstrate the utility of SRD annotation in identifying potential false negatives and inaccuracies within current metabolic pathway databases. Our findings highlight the SRD metric as an objective, quantitative and easy-to-compute annotation for gene-metabolite pairs, suitable to integrate statistical evidence to biological networks.
Automated engineering of synthetic metabolic pathways for efficient biomanufacturing
2021, Metabolic Engineering
Metabolic engineering involves the engineering and optimization of processes from single-cell to fermentation in order to increase production of valuable chemicals for health, food, energy, materials and others. A systems approach to metabolic engineering has gained traction in recent years thanks to advances in strain engineering, leading to an accelerated scaling from rapid prototyping to industrial production. Metabolic engineering is nowadays on track towards a truly manufacturing technology, with reduced times from conception to production enabled by automated protocols for DNA assembly of metabolic pathways in engineered producer strains. In this review, we discuss how the success of the metabolic engineering pipeline often relies on retrobiosynthetic protocols able to identify promising production routes and dynamic regulation strategies through automated biodesign algorithms, which are subsequently assembled as embedded integrated genetic circuits in the host strain. Those approaches are orchestrated by an experimental design strategy that provides optimal scheduling planning of the DNA assembly, rapid prototyping and, ultimately, brings forward an accelerated Design-Build-Test-Learn cycle and the overall optimization of the biomanufacturing process. Achieving such a vision will address the increasingly compelling demand in our society for delivering valuable biomolecules in an affordable, inclusive and sustainable bioeconomy.
Bioinformatics Approaches for Interpreting Metabolomics Datasets
2020, Comprehensive Foodomics
Metabolomics has become mature technology for biological investigations in fundamental metabolic research to large-scale epidemiological projects, yielding new biological mechanisms and risk factors for chronic diseases. Typical metabolomics dataset can have hundreds of peaks which can be analyzed using a range of bioinformatics approaches. New computational tools have emerged for generating comprehensive and deeper insights from today's metabolomics datasets and their association with clinical and other omics data.
Retrosynthetic design of metabolic pathways to chemicals not found in nature
2019, Current Opinion in Systems Biology
Citation Excerpt :
Different networks can be used, depending on the engineering problem being addressed. For example, if the goal is to identify alternate routes through metabolism to increase flux through genetic engineering, then only the network associated with the chassis organism needs to be searched (e.g., E. coli has 3755 metabolites) [197–203]. The goal could also be to define metabolic pathways from other organisms that could be moved into the target host [202,204–**206].
Biology produces a universe of chemicals whose precision and complexity is the envy of chemists. Over the last 30 years, the expansive field of metabolic engineering has many successes in optimizing the overproduction of metabolites of industrial interest, including moving natural product pathways to production hosts (e.g., plants to yeast). However, there are stunningly few examples where enzymes are artificially combined to make a chemical that is not found somewhere in nature. Here, we review these efforts and discuss the challenges limiting the construction of such pathways. An analogy is made to the retrosynthesis problem solved in chemistry using algorithmic approaches, recently harnessing artificial intelligence, noting key differences in the needs of the optimization problem. When these issues are addressed, we see a future where chemistry and biology are intertwined in reaction networks that draw on the power of both to build currently unobtainable molecules across consumer, industrial, and defense applications.
A review of computational tools for design and reconstruction of metabolic pathways
2017, Synthetic and Systems Biotechnology
Metabolic pathways reflect an organism's chemical repertoire and hence their elucidation and design have been a primary goal in metabolic engineering. Various computational methods have been developed to design novel metabolic pathways while taking into account several prerequisites such as pathway stoichiometry, thermodynamics, host compatibility, and enzyme availability. The choice of the method is often determined by the nature of the metabolites of interest and preferred host organism, along with computational complexity and availability of software tools. In this paper, we review different computational approaches used to design metabolic pathways based on the reaction network representation of the database (i.e., graph or stoichiometric matrix) and the search algorithm (i.e., graph search, flux balance analysis, or retrosynthetic search). We also put forth a systematic workflow that can be implemented in projects requiring pathway design and highlight current limitations and obstacles in computational pathway design.
Evolutionary algorithm for metabolic pathways synthesis
2016, BioSystems
Metabolic pathway building is an active field of research, necessary to understand and manipulate the metabolism of organisms. There are different approaches, mainly based on classical search methods, to find linear sequences of reactions linking two compounds. However, an important limitation of these methods is the exponential increase of search trees when a large number of compounds and reactions is considered. Besides, such models do not take into account all substrates for each reaction during the search, leading to solutions that lack biological feasibility in many cases. This work proposes a new evolutionary algorithm that allows searching not only linear, but also branched metabolic pathways, formed by feasible reactions that relate multiple compounds simultaneously. Tests performed using several sets of reactions show that this algorithm is able to find feasible linear and branched metabolic pathways.

View all citing articles on Scopus

View full text

Journal of Molecular Biology

Metabolic Pathfinding Using RPAIR Annotation

Abstract

Introduction

Section snippets

Graph construction and evaluation

Conclusion

Network construction

Acknowledgements

Simul. Pract. Theory

J. Mol. Biol.

TIBTECH

J. Biol. Chem.

J. Steroid Biochem. Mol. Biol.

Biomol. Eng.

Eur. J. Oper. Res.

Organizing and computing metabolic pathway data in terms of binary relations

Pac. Symp. Biocomput.

KEGG for linking genomes to life and the environment

Nucleic Acids Res.

EcoCyc: an encyclopedia of Escherichia coli genes and metabolism

Nucleic Acids Res.

The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases

Nucleic Acids Res.

Pathway analysis in metabolic databases via differential metabolic display

Bioinformatics

Genome-scale reconstruction of the Saccharomyces cerevisiae metabolic network

Genome Res.

Global reconstruction of the human metabolic network based on genomic and bibliomic data

Proc. Natl Aced. Sci.

Reconstruction of metabolic networks using incomplete information

Proc. Int. Conf. Intell. Syst. Mol. Biol.

Reconstruction of metabolic networks from genome data and analysis of their global structure for various organisms

Bioinformatics

Expansion of the BioCyc collection of pathway/genome databases to 160 genomes

Nucleic Acids Res.

KAAS: an automatic genome annotation and pathway reconstruction server

Nucleic Acids Res.

Toward the automated generation of genome-scale metabolic networks in the SEED

BMC Bioinf.

Evaluation of computational metabolic-pathway predictions for Helicobacter pylori

Bioinformatics

Filling gaps in a metabolic network using expression information

Bioinformatics

Identifying metabolic enzymes with multiple types of association evidence

BMC Bioinf.

The large-scale organization of metabolic networks

Nature

The small world of metabolism

Metab. Eng.

The metabolic world of Escherichia coli is not small

Proc. Natl Acad. Sci.

Graph-based analysis of metabolic networks

Ernst Schering Res. Found. Workshop