Metagenomics

The phrase ‘metagenome of the soil’ was first used by Handelsman et al. (1998) to describe the collective genomes of soil microflora. In the area of microbial ecology, the term ‘metagenomics’ is now synonymous with the culture-independent application of genomics techniques to the study of microbial communities in their natural environments (Chen and Pachter 2005). Metagenomics arose in reaction to the observation that the majority of microorganisms on Earth resist life in captivity, i.e. they cannot be grown in broth or on plates in the laboratory. An often-cited estimate is that as much as 99% or more of microbial life remains unculturable, and therefore cannot be studied and understood in a way that microbial ecologists have become accustomed to over the past century. Metagenomics exploits the fact that while some microorganisms are culturable and others are not, all of them (i.e., 100%) are life-forms based on DNA as a carrier of genetic information. The metagenomic toolbox allows accessing, storing, and analyzing this DNA and thus can provide an otherwise hard-to-attain insight into the biology and evolution of environmental microorganisms, independent of their culturable status.

In modern metagenomics, three major and oftentimes overlapping directions can be recognized. The first is aimed at linking phylogeny to function. Once microbial ecologists got a satisfactory grasp on the issue of ‘who is out there?’ (Amann 2000), they set out to answer the question ‘who’s doing what out there?’. As one of many complementary methodologies, metagenomics can help answer that question in an indirect, culture-independent manner. One example is through phylogenetic anchoring (Riesenfeld et al. 2004b), which involves the screening of large-insert environmental libraries for clones that carry phylogenetically informative genes (to reveal the ‘who’) and analyzing their flanking DNA for genes that reveal possible environmental functions of the DNA’s owner. A second trend in modern metagenomics involves its exploitation for the discovery of enzymes with novel, industrial and possibly exploitable properties. This aspect of metagenomics was well-reviewed recently by Lorenz and Eck (2005), who conclude that metagenomics “provides industry with an unprecedented chance to bring biomolecules into industrial application”. The third and most recent trend in metagenomics is the mass sequencing of environmental samples. The promise of this approach is to offer a more global or systems-biology view of the community under study. Indeed, in several instances has mass sequencing led to more complete assessments of genetic diversity and to first insights into the interactivities that occur in microbial communities (DeLong et al. 2006; Edwards et al. 2006; Gill et al. 2006; Hallam et al. 2004; Schmeisser et al. 2003; Tringe et al. 2005; Tyson et al. 2004; Venter et al. 2004).

Early pioneers in the field of metagenomics were Schmidt et al. (1991) who studied the phylogenetic diversity of an oligotrophic marine picoplankton community in the north central Pacific Ocean. Their original protocol involved the (1) isolation of bulk genomic DNA from picoplankton collected by tangential flow filtration, (2) fragmentation, size-fractionation (10–20 kb) and cloning of the mixed-population DNA into bacteriophage lambda, (3) screening of the resulting library of recombinant phages by hybridization with 16 S rDNA probes, (4) sequencing of PCR-amplified 16 S rDNA from positive clones, and (5) analysis of the DNA sequence from unique clones to database entries to reveal some of the uncultured diversity of picoplanktonic life in the Pacific Ocean. This series of subsequent steps (DNA isolation, cloning, library screening, sequencing of interesting clones, and DNA comparison) is in essence the classical metagenomic strategy as defined by Handelsman et al. (1998). This basic theme of metagenomics is also depicted schematically in Fig. 1. This figure will be used as a framework to describe the metagenomic methodology and illustrate the many variations that have evolved over the past few years. For good overviews on the subject of metagenomics, several recent reviews are available (Daniel 2005; Deutschbauer et al. 2006; Green and Keller 2006; Handelsman 2004; Kowalchuk et al. 2007; Schloss and Handelsman 2005; Schmeisser et al. 2007; Streit and Schmitz 2004; Tringe and Rubin 2005; Ward 2006; Xu 2006). The present article presents an overview of metagenomics that is to some extent biased towards the discussion in the final section about how to capitalize on metagenomics as a tool in the study of PGPR.

Fig. 1
figure 1

Schematic representation of the classical metagenomic protocol (1–5) and variations on a theme (6–10). Each of the following steps is discussed in more detail in the text: (1) Isolation of metagenomic DNA; (2) Cloning of metagenomic DNA; (3) Metagenomic library screening; (4) DNA sequencing; (5) Sequence analysis; (6) Environmental shotgun sequencing; (7) Enrichment of a particular subpopulation; (8) Enrichment strategies at the DNA level; (9) Direct all-or-none selection for clones of interest; (10) Isolation of metagenomic RNA

Isolation of metagenomic DNA

The first and obviously most important step in any metagenomic approach is the isolation of DNA from the environment under study. Looking at the many protocols that have been published to date (for a fairly broad overview, see individual chapters in Kowalchuk et al. 2004), it becomes apparent that no single protocol is suitable for the extraction of DNA from all environments. Key issues to consider at this stage are the quantity, purity, integrity, and representativeness of the DNA after isolation. DNA can be extracted from microorganisms by lysis either directly in the environmental sample, or indirectly, i.e. after separation and concentration of the microbial cells from their environmental matrix. The latter is often inevitable or recommended for many environments but this may be for different reasons. Isolation of DNA directly from ocean water is not practical given the low microbial density, so some filtration step is usually performed to first concentrate cells (Fuhrman et al. 1988). Many soil types are notorious for the presence of contaminants such as polyphenolics that co-purify with DNA and can inhibit subsequent steps in the metagenomic process (Tebbe and Vahjen 1993). To prevent this, several groups have developed ways to first separate cells from the soil matrix, e.g. by application to a Nycodenz cushion (Lindahl and Bakken 1995). Others preferentially extract DNA directly from soil, mainly for reasons of increased DNA yield and lesser bias (see below). In these cases, a separate DNA purification step is often included to minimize contamination with unwanted soil substances (Zhou et al. 1996). For direct extraction/purification of environmental DNA from soil, many commercial kits are now also available, for example the SoilMaster™ DNA Extraction Kit (Epicentre, Madison, WI) and PowerMax™ Soil DNA Isolation Kit (Mo Bio Laboratories, Carlsbad, CA).

There is a growing recognition that different DNA extraction methods can yield different results in terms of microbial representation in the final DNA sample (Kozdroj and van Elsas 2000). For example, indirect DNA extraction from marine sediments resulted in a reduced observed microbial diversity at the DNA level than when direct lysis methods were used (Luna et al. 2006). As noted elsewhere (Tringe and Rubin 2005), the harsh lysis methods that are necessary to extract DNA from one organism will cause degradation of the DNA from another organism. A practical implication of this is that more than one type of extraction method should be used in the construction of a metagenomic library. Some DNA extraction protocols have a requirement for the minimal amount of starting material, which can be a disadvantage for the isolation of metagenomic DNA from environments that permit only small sampling sizes. A related issue is the efficiency of DNA extraction. Chemical extraction is usually less effective than procedures that involve rigorous bead-beating (Miller et al. 1999). If DNA yield is a problem – for example, when it is less than the 0.5- to 4-μg minimum to construct a library for shotgun sequencing (Abulencia et al. 2006) – one could consider the use of recently developed protocols for the amplification of whole genomes from environmental samples. One example is the use of Φ29 DNA polymerase which by a process of multiple displacement can provide enough DNA from only five bacterial cells for subsequent analyses (Abulencia et al. 2006).

A final factor to consider is integrity of the isolated DNA which dictates how it can be used in the subsequent steps of the metagenomic protocol. Bead-beating-type procedures tend to fragment DNA and are therefore mostly used for the construction of small-insert libraries, e.g. for shotgun sequencing. In contrast, gentle chemical lysis of cells recovered from soil in agar plugs can produce very large clonable DNA with fragment sizes exceeding 1 Mbp (Berry et al. 2003). The latter is more interesting when the objective is to screen metagenomic libraries for phylogenetic anchors or for particular phenotypes (see “Metagenomic library screening”).

Cloning of metagenomic DNA

After extraction and purification of metagenomic DNA, it is usually fragmented to the desired size (either enzymatically or by physical force) and size-separated, if necessary, e.g. on agarose gels (Liles et al. 2004). To carry the metagenomic DNA, a variety of vectors is available, including plasmids, bacteriophages, cosmids, fosmids, and bacterial artificial chromosomes (BACs). A common way to describe these vectors is by the size of DNA fragments that they can typically accommodate, i.e. 0.5–2, 7–10, 35–40, and 80–120 kb, respectively (Xu 2006). However, by this definition some vectors would qualify both as fosmids and as BACs. For example, Epicentre’s pCC1FOS and pCC1BAC are virtually identical but are sold as fosmid and BAC, respectively. The difference in nomenclature lies not in the nature of the vector, but in the method that is used to introduce vector-ligated metagenomic DNA into the host strain, in this case E. coli. With pCC1FOS, this is mediated through lambda packaging, while pCC1BAC is transformed by electroporation. A limitation of the packaging procedure is that there is an upward size-selection against DNA inserts larger than ∼40 kb. On the other hand, DNA inserts smaller than ∼35 kb are also rejected, so that a typical fosmid library contains very few false-negatives, i.e. clones that lack an insert or contain very small fragments. With BACs, large DNA fragments can be cloned (Rondon et al. 2000), even as large as 1 Mbp (Berry et al. 2003), but it is also more common for a BAC library to be quite variable in insert size, with a strong selection for small insert sizes and only a relatively small subset of clones carrying very large fragments (Rondon et al. 2000).

Insert size is a critical matter in any metagenomic approach. For environmental shotgun sequencing (see “Environmental shotgun sequencing”), insert sizes of 1.5 to 3 kb are preferred, whereas larger inserts are needed if one wants to maximize on the phylogenetic anchoring approach or if one’s objective is to clone large operons e.g. those involved in the production of certain antibiotics (Handelsman et al. 1998). Especially in cases such as the latter, where metagenomic DNA of interest is only detected if it is expressed, the choice of host is crucial to success. Escherichia coli is the standard in most metagenomic applications, mainly because of ease of manipulation, but it has limited and biased ability in expressing heterologous genes. Gabor et al. (2004a) provided estimates that only 7% of coding sequences from representatives of the class Actinobacteria would actually be expressed in an independent manner in an E. coli background, compared to 73% of the genes from Firmicutes origin. Indeed, the use of alternative hosts, such as Streptomyces (Courtois et al. 2003; Wang et al. 2000), Pseudomonas (Martinez et al. 2004), or Rhizobium leguminosarum (Li et al. 2005) can reveal gene activities that would not have been picked up in an E. coli library.

A final consideration in the construction of a metagenomic DNA library is the number of clones that is needed to achieve the desired outcome. It has been estimated that 106 BAC clones with 100-kb inserts are required to represent the genomes of all the different prokaryotic species present in 1 g of soil (Handelsman et al. 1998), assuming that all species are equally abundant. If this is not the case, >1011 BAC clones, or 10 Tbp of DNA, would be required to achieve adequate genomic representation of rare species (Daniel 2005). Hence, cloning the entire microbial metagenome of an environment as complex as soil is not feasible. To illustrate a more modest goal, suppose one sets out to find bacterial antibiotic resistance genes in a metagenomic library. If one such gene occurs in every 100 bacterial genomes in the metagenome (with an average genome size of 5 Mbp), it would take at least 57,500 40-kb inserts to find a single such gene with a 99% probability. This relative low success rate has not only driven the interest in and application of automated handling systems (‘picking robots’) to construct and screen metagenomic libraries. It has also led to the exploration and development of variations on the classic metagenomic strategy in order to specifically increase the chances of finding clones of interest (see “Enrichment of a particular subpopulation,” “Enrichment strategies at the DNA level,” “ Direct all-or-none selection for clones of interest,” and “Isolation of metagenomic RNA,” and the corresponding arrows in Fig. 1).

Metagenomic library screening

Two types of screening strategies can be distinguished. Those that are sequence-based capitalize on pre-existing DNA information to find clones in a metagenomic library that carry inserts with sequence similarity to a gene or locus of interest. This is achieved, for example, by colony or plaque hybridization with probes (Knietsch et al. 2003a; Schmidt et al. 1991) or by PCR using a specific primer set (e.g. Courtois et al. 2003). Most common targets of these searches are the genes for subunits of the ribosome (e.g. Liles et al. 2003; Quaiser et al. 2002), most frequently the small subunit or 16 S rRNA gene. These genes serve as phylogenetic anchors to link the identity of a DNA’s owner to part of its biology and evolution by analysis of the DNA sequences flanking the rRNA genes. Special mention is also warranted for the screening method LIL-FISH, for large-insert library fluorescent in situ hybridization (Leveau et al. 2004). In essence, it is an activity screening that utilizes FISH to identify clones in a metagenomic library which heterologously express ribosomal RNA genes. Its compatibility with FACS makes it an attractive method for high-throughput screening of libraries for phylogenetic anchors. A final screening strategy that still needs to be fully explored for its application in metagenomics but worthy of mention here is magnetic capture-hybridization (Jacobsen 1995). This technique involves the use of magnetic beads which are coated with a single-stranded DNA probe complementary to a target gene of interest and which can be used to identify fosmids of interest in a library (Hackl et al., unpublished data).

The second type of library screening is based on the expression of genes of interest through detection of their associated phenotypes in a clone library. The importance of a suitable host strain in this context has already been discussed. Preferably, the gene’s phenotype is readily detectable and often involves an enzymatic activity on a plate or in a microtiter plate. Recently, several novel screening methods have been described. Two of these, SIGEX and METREX, are based on the use of green fluorescent protein as a reporter of gene activity. SIGEX stands for substrate-induced gene expression and was developed by Uchiyama et al. (2005). It allows a high-throughput screen for catabolic genes by exploitation of the fact that many such genes, e.g. those that code for the degradation of aromatic compounds are commonly induced by the substrate they are targeted to degrade. Hence, by coupling metagenomic DNA to a promoterless gfp gene, library clones carrying genes of interest can be readily identified as green fluorescent cells or colonies in the presence of the inducing substrate. The power of SIGEX lies in the fact that it is compatible with fluorescence-activated cell sorting, which makes library screening a matter of seconds or minutes. METREX (Williamson et al. 2005) is a similar strategy that helps identify functionally active library clones based on an intracellular screen for quorum-sensing inducers.

DNA sequencing

Once clones of interest are identified by a functional or sequenced-based screening, the next step usually involves the elucidation of the inserts’ partial or complete DNA sequence. Typically, large-insert fosmid clones are subjected to shotgun or transposon sequencing. To get a typical 10× coverage of a 45-kb fosmid (37-kb insert and 8-kb vector), 600 sequence reads of ∼750 bp each are needed.

Most of modern sequencing takes place according to the original Sanger method (Sanger et al. 1977) of chain-termination by dideoxynucleotides and subsequent separation of differently sized DNA fragments by capillary gel electroporesis. This method has proven to be very compatible with the high-throughput demand of many (meta)genomic projects. Recently, novel types of sequencing methods have been developed that are faster and have an even greater capacity. For example, pyrosequencing now allows the sequencing of 25 million bases in one 4-h run with an accuracy of 99.96% (Margulies et al. 2005). While this technology will not likely replace traditional sequencing, it has great utility for the environmental shotgun sequencing strategies discussed later (see “Environmental shotgun sequencing”). While there are some serious limitations to pyrosequencing – the main ones being the generation of relatively short (∼100-bp) reads and the poor ability of most current DNA analysis programmes to assemble such short reads – it is an exciting new development that will undoubtedly revolutionize the field of microbial (meta)genomics.

Sequence analysis

There are many ways to analyze metagenomic DNA sequences (Chen and Pachter 2005). Gene-finding is supported by software applications such as GeneDB (Meyer et al. 2003), Artemis (Rutherford et al. 2000), Glimmer (Delcher et al. 1999), and FGenesB pipeline (www.softberry.com). These programmes use special algorithms to identify coding sequences, as well as other features such as promoters, terminators, operons, tRNA and rRNA. They can also provide a functional prediction of each identified putative gene based on sequence similarity of its predicted product. The programme MetaGene (Noguchi et al. 2006) is a prokaryotic gene-finding application that was designed specifically for the analysis of metagenomic datasets with many unassembled reads. Such datasets are typical for many types of environmental shotgun sequence projects (see “Environmental shotgun sequencing”). A crude approach to gene-finding is a BlastX-type of analysis, by which the DNA query sequence is first translated into protein sequences in all six reading frames, after which these products are compared against the existing protein databases. This approach depends heavily on the availability of similar sequences in the database and novel genes might be missed in this way.

A particular problem with small-sized metagenomic DNA fragments is that they often lack phylogenetic markers in the shape of e.g. a 16 S rRNA gene. In those cases, parameters such as G + C content, BlastX scores, and codon usage frequencies can also be used as indicators of phylogenetic origin (Chen and Pachter 2005). Analysis of oligonucleotide frequencies is another promising approach to this problem, since these frequencies tend to exhibit species-specific patterns (Abe et al. 2005). TETRA (Teeling et al. 2004) is available as a web-service (www.megx.net/tetra) or stand-alone application that automates the task of comparing tetranucleotide frequencies. It computes correlation coefficients between patterns of tetranucleotide usage in DNA and works best with sequences of about 40 kb. Most recently, TETRA was used to group metagenomic DNA sequences from the marine worm Olavius algarvensis into four clusters representing four prokaryotic symbionts (Woyke et al. 2006).

Environmental shotgun sequencing

As already alluded to in the introduction, a recent development in metagenomics is environmental shotgun sequencing, also known as whole-genome sequencing (Chen and Pachter 2005). It usually involves the construction and end-sequencing of small-insert libraries from DNA directly isolated from the environment under study, although several variations exist. For example, the dataset of DeLong et al. (2006) consisted of end-sequences from a large-insert fosmid library, and Edwards et al. (2006) applied pyrosequencing which makes library construction redundant altogether. Examples of environments from which prokaryotic communities have been shotgun-sequenced are the Sargasso sea (Venter et al. 2004), the North Pacific ocean’s surface to near-sea floor depths (DeLong et al. 2006), sunken whale carcasses (Tringe et al. 2005), deep-sea sediment (Hallam et al. 2004), an acid mine biofilm (Tyson et al. 2004), groundwater in banded iron formations of a subsurface mine (Edwards et al. 2006), the human distal gut (Gill et al. 2006), drinking-water networks (Schmeisser et al. 2003), and an agricultural soil (Tringe et al. 2005).

One characteristic that all environmental shotgun sequencing projects share is the incredible amount of data that is generated (Schloss and Handelsman 2005). The Sargasso Sea project alone produced 2 million sequence reads, or 1,600 Mbp of DNA sequence (Venter et al. 2004). This explosion in environmental sequence data led several authors (Handelsman 2005; DeLong 2004) to refer to metagenomics as ‘megagenomics’ or ‘mega-metagenomics,’ respectively. This ‘genomics on a massive scale’ poses considerable challenges for data assembly and annotation. Successful assembly of individual reads into contigs is inversely proportional to the complexity of the prokaryotic community from which the DNA originated. The acid mine biofilm studied by Tyson et al. (2004) represented a relatively simple community, with only three bacterial and three archeal lineages. From two members of the biofilm, Ferroplasma type II and Leptospirillum group II, near-complete genome sequences could be obtained after assembly. In more complex environments, this is typically not the case. For example, 50% of the reads from the Sargasso Sea could not be assembled, and for the agricultural soil, with an estimated species richness of >3,000 (Tringe et al. 2005), the percentage of unassembled reads even approached 100%. Interestingly, Tringe et al. (2005) recently questioned the need for assembly in such cases altogether and introduced the concept of environmental gene tags (EGTs), which takes a more gene-centric approach to the analysis of environmental sequencing. EGTs are in essence annotated individual reads from a metagenome shotgun project. Predicted genes on these EGTs are derived from individual members of the microbial community under study, and could potentially code for a habitat-specific adaptation. By comparison of EGT ‘fingerprints’ of different environments, Tringe et al. (2005) indeed observed emerging patterns of habitat-specific gene functions, for example, an over abundance of rhodopsin-like proteins in ocean surface waters, enzymes for the degradation of plant material in soil, and sodium transport proteins in marine environments. Basically, this approach represents an in silico version of substractive hybridization (Schloss and Handelsman 2005) and presents one of several examples (Rodriguez-Brito et al. 2006) on how to tackle the challenges of comparative analysis of metagenomic datasets.

Enrichment of a particular subpopulation

This section is the first of four that describe some of the variations that have evolved from the basic metagenomic theme (Fig. 1) with the specific goal to increase the chance of finding a gene or function of interest. One of those variations is the subject of this section and involves community fractionation, i.e. the isolation of DNA only from a selected subpopulation of the environmental sample under study. One such type of selection is size. Using filters with different pore sizes, bacteriophage communities can be selectively enriched for metagenomic analyses (Edwards and Rohwer 2005). Similarly, size fractionation was used to select for consortia > 3 μm from deep-sea sediments (Hallam et al. 2004). Other strategies for enrichment of subpopulations include affinity purification or differential lysis (Tringe and Rubin 2005).

A very effective type of selection at the community level is based on the application of selective pressure to the environmental sample under study in order to favour the growth of bacteria expressing a desired activity. In an early example of this strategy, Healy et al. (1995) readily recovered clones expressing cellulase and xylosidase activities from libraries of DNA isolated from a mix of thermophilic, anaerobic digesters, that were in continuous operation with lignocellulosic feedstocks for >10 years. In another case, Gabor et al. (2004b), DNA was isolated from enrichment cultures in which amides (either singly or as a mixture of aromatic and non-aromatic forms) were supplied as the sole nitrogen source to obtain a library enriched for amidases with different substrate specificities. Similarly, Entcheva et al. (2001) enriched for biotin-producing bacteria to isolate new biotin biosynthesis operons, whereas Knietsch et al. (2003b) pre-selected for utilization of glycerol and 1,2-propanediol to metagenomically identify genes encoding alcohol oxidoreductases.

Another clever selection procedure allows the isolation of DNA only from the live fraction of cells in a community. Nocker and Camper (2006) reported that propidium monoazide (PMA), like propidium iodide, is highly selective in penetrating only ‘dead’ bacterial cells where it intercalates in the DNA and can be covalently cross-linked by exposure to bright light. This process renders the DNA insoluble and results in its loss during subsequent genomic DNA extraction. Subjecting a bacterial population comprised of both live and dead cells to PMA treatment would thus result in selective removal of DNA from dead cells, which can be useful if one is interested only in analyzing the currently active, living fraction of a microbial community.

Enrichment strategies at the DNA level

A second kind of selection for genes or gene functions of interest involves selection at the level of metagenomic DNA after it has been isolated from the environment under study. The simplest application of this type of enrichment is the use of metagenomic DNA as a template in PCR (with degenerate primers, if needed) to amplify and clone genes of interest (Marchesi and Weightman 2003). A method described by Nesbø et al. (2005) allows for the enrichment only of metagenomic DNA that carries phylogenetic markers. Central in this approach is the restriction enzyme I-CeuI, which targets a 19-bp sequence that is conserved in 23 S rRNA of most bacteria. After isolation, fragmentation and end-repair, metagenomic DNA is digested with I-CeuI and ligated to vector pCC1.FOS.CeuI.23S, a derivative of pCC1FOS containing unique I-CeuI and blunt sites. To illustrate the effectiveness of this approach, fosmid libraries were constructed from anaerobic sediments sampled from Baltimore harbour, and found to be enriched by an approximate factor of 80 for clones carrying 23 S rRNA genes. An attractive side advantage of the method is that end-sequencing of clones in this library, which is relatively cheap, can provide instant information on the identity of the DNA’s origin based on 23 S similarity.

Stable isotope probing is a technique by which to isolate from the metagenome pool specifically that DNA which is derived from organisms that can metabolize a particular substrate (Friedrich 2006). This is achieved by incorporation of 13C- or 15N-labelled substrates into biomass, including DNA, of the active subpopulation in the community under study. The labelled DNA (from the active microorganisms) is then separated from unlabelled DNA through density gradient centrifugation and used as starting material for the cloning into vectors. One example is the report by Dumont et al. (2006) which describes the application of 13CH4 to forest soil to generate a BAC library enriched for methane monooxygenase genes. Schwarz et al. (2006) used 13C-labelled glycerol on a sediment sample of the Wadden Sea to enrich by 2.1- to 3.8-fold DNA fragments that carried genes encoding coenzyme B12-dependent glycerol dehydratases.

Galbraith et al. (2004) have described a novel method based on suppressive subtractive hybridization to specifically isolate from one metagenomic DNA sample (designated the ‘tester’) fragments that are absent in another metagenomic DNA sample (the ‘driver’). Applied to the study of microbial populations in the rumen, this approach revealed an unexpectedly large difference in archaeal community structure between steers fed identical diets. It should be noted that there are several drawbacks to this method: e.g. large amounts of subtractive sequences must be read to achieve an accurate sense of the degree of genetic diversity between two samples. However, the technique is very suitable for comparative purposes, able to zoom in on relatively small differences, and allows to possibly link those differences in genetic diversity to physical, chemical or biological dissimilarities between environmental samples.

Direct all-or-none selection for clones of interest

A very powerful instrument in the metagenomic toolbox is the use of conditional survival of the metagenomic host through functional complementation. In essence, one exploits the fact that the host that is being used needs a particular type of gene or gene cluster from the pool of metagenomic DNA fragments in order to survive an imposed condition. This concept has been applied by several groups. For example, Li et al. (2005) screened a metagenomic library for clones that could correct tryptophan auxotrophy in hosts E. coli and R. leguminosarum, and in doing so identified several different trp operons. Gabor et al. (2004b) were successful in finding metagenomic clones that complemented the leucine auxotrophy of the host strain E. coli TOP10 host by expression of amidase activity on a medium containing phenylacetyl-l-leucine or d-phenylglycine-l-leucine as the sole source of leucine. Similarly, Entcheva et al. (2001) used a biotin-auxotroph E. coli strain to pick up several clones with biotin biosynthesis operons.

The major advantage of this approach of complementation is its all-or-none character: only if a gene is present and expressed will the host survive, so that false-positives can be expected to be rare. False-negatives, on the other hand, may be more frequent, depending on the host or range of hosts that is being used. Li et al. (2005) identified at least one set of trp genes that complemented R. leguminosarum but not E. coli, while Entcheva et al. (2001) noted that all of the biotin operons that were recovered with E. coli as a host had highest similarity to similar operons in Enterobacteriaceae. Thus, host choice greatly determines the success rate of this strategy in finding genes and gene functions of interest.

Isolation of metagenomic RNA

A final variation on the classical metagenomic theme starts with the isolation of RNA, not DNA, from the environment under study, followed by reverse transcription of this RNA, and cloning of the resulting cDNA. This approach can offer answers not to the question ‘who is out there?’ but to ‘who is active out there?’. It is a technically challenging approach, mostly because of RNA instability. Working protocols are available for the isolation of environmental prokaryotic RNA (Hurt et al. 2001) and cloning of short cDNA sequences (Poretsky et al. 2005). Grant et al. (2006) recently described procedures for stabilizing eukaryotic RNA in environmental samples in the field such that they can be transported back to the laboratory, the RNA isolated, and cDNA libraries made for subsequent sequencing and expression studies. Mills et al. (2004) were able to determine the composition of the metabolically-active fraction of microbial communities in marine sediments by rRNA extraction and reverse-transcription to obtain clonable complementary 16 S ribosomal DNA.

Metagenomic stories of success and words of caution

The present excitement about metagenomics has its roots in a number of clear success stories (Table 1). One is the discovery of a member of a new class of rhodopsins, encoded on a BAC-insert from an uncultivated marine α-Proteobacterium (Béjà et al. 2000b). Proteorhodopsin is a retinal-binding bacterial integral membrane protein that acts as a light-driven proton pump in surrogate host E. coli (Béjà et al. 2001). Subsequent studies have revealed the abundance and diversity of this type of protein in ocean waters (Venter et al. 2004). This discovery clearly illustrates the utility of metagenomics to reveal unsuspected biological functions. Another story of success is the Sargasso Sea project (Venter et al. 2004). It demonstrated, unambiguously, the muscle of metagenomics: by sheer brute force, 1 Gbp of nonredundant sequence was generated, two to three orders of magnitude more than the human genome (Venter et al. 2001). From <2 cubic meters of ocean water, an unprecedented microbial diversity was uncovered, including 148 previously unknown bacterial phylotypes and 1.2 million previously unknown genes. The Sargasso Sea study greatly stimulated the discussion on the limits and the future of metagenomics and still serves as a reference point for many of the environmental shotgun sequencing projects that followed in its footsteps. A third tale of metagenomic fame is that of the acid mine drainage biofilm (Tyson et al. 2004). It actually makes several interesting points, most of which are a direct consequence of the low-level complexity of the community under study. Due to the latter, the genomes of two members in the consortium could be reconstructed to near completion, enabling a true systems-biology approach to studying the interplay of microbial metabolism and mineral dissolution (Allen and Banfield 2005). A proteomic analysis of this community showed that biofilm polymer production and nitrogen fixation appeared to be partitioned among community members and that an abundant cytochrome-like protein might be essential to the production of acid mine drainage (Ram et al. 2005). The metagenome sequence also revealed single-nucleotide polymorphisms which prompted addressing the biofilm community in terms of population genomics and evolutionary ecology (Whitaker and Banfield 2006). Furthermore, based on clues from its metagenomic sequence, one of the previously unculturable consortium members, Leptospirillum ferrodiazotrophum, could be grown in the laboratory (Tyson et al. 2005), reducing the number of ‘unculturable’ bacteria by one.

Table 1 A selection of success stories and milestones in the field of metagenomics

Many of the stories of metagenomic success also contain important words of caution. In hindsight and in words of DeLong (Sreenivasan 2001), the discovery of the proteorhodopsin gene was to a large degree a matter of serendipity, or luck. Since it was one of the very first successful cases to demonstrate the ability of metagenomics to link phylogeny and function, the bar was put high for any future endeavour in this direction. In practice, while the chance of finding novel genes on any metagenomic fosmid- or BAC-insert is not very low, it is much less probable to find a gene coding for a never-seen-before prokaryotic life-style such as is the capture of light for energy. Unrealistic expectations may also arise from a false sense of probability of finding a particular gene of interest in a metagenome library. Even without considering potential problems of expressing heterologous DNA in a surrogate host or insufficient homology to identify clones using PCR or hybridization, it remains a laborious and often underestimated task to screen through many thousands of clones, especially in the case of metagenomic DNA from environments with high microbial complexity. For example, it took 1,186,200 clones containing a collective 5.4 Gbp of soil DNA to identify nine unique clones that conferred resistance to aminoglycoside antibiotics and one clone expressing resistance to tetracycline (Riesenfeld et al. 2004a). A hit rate of one interesting gene per 5–5,000 Mbp of cloned metagenomic DNA is not uncommon for activity-based screenings (Lorenz and Eck 2005) and should probably be considered a normal operating range in any metagenomic screening.

A final consideration in this section is the recent argument (Oremland et al. 2005) that the predictive and interpretative value of metagenomics is limited by the validity of database entries. Many genes are listed with unknown function and for many genes that have a function assigned based on sequence similarity, this function still needs to be validated experimentally. In this context, it is of extreme value to have culturable representatives available from the environment under study. The Sargasso Sea study and other massive-scale shotgun projects clearly show the rapid progress in sequencing capacity, but do not contribute greatly to our ability to assign gene function. Metagenomics is more than a descriptive science; it should be appreciated as a methodology that is complementary to conventional approaches in testing hypotheses on the composition, diversity and functionality of microbial communities.

Metagenomics and the study of plant growth-promoting rhizobacteria

It was mentioned earlier in the introduction that the first and most important step in a metagenomic approach is the isolation of DNA from the environment under study. This is not quite right: it is probably more important to first ask whether it is warranted or wise to invest in a metagenomics approach at all and for what purpose. The remainder of this article will be an attempt to address this question in general terms for the study of plant growth-promoting rhizobacteria (PGPR).

The term PGPR (Kloepper and Schroth 1978) refers to those plant root (‘rhizosphere’)-associated bacteria that are capable of stimulating plant growth, e.g. by improving plant nutrition, by the production of plant growth regulators or by preventing the attack of pathogenic microorganisms. PGPR vary in their degree of intimacy with the plant, from intracellular, i.e. existing inside root cells, to extracellular, i.e. free-living in the rhizosphere (Gray and Smith 2005). Some PGPR are commercially available as inoculants and have applicability for example in agriculture, forest regeneration, and phytoremediation of soils (Lucy et al. 2004).

There is a clear potential for metagenomics to contribute to the study of microbial communities of the rhizosphere, in particular PGPR. Possible contributions include (1) the discovery of novel plant-growth promoting genes and gene products, and (2) the characterization of (not-yet-)culturable PGPRs. Before discussing these in more detail below, it is worth noting that in practical terms, the application of metagenomics to PGPR in the rhizosphere greatly benefits from previous advances in DNA isolation and library construction from other environments. Rhizosphere soil poses more or less the same challenges as bulk soil, for which several metagenomic success stories have been published (Daniel 2005). Probably the biggest obstacle in the construction of a metagenomic library from rhizosphere soil DNA is the relative low availability of starting material. To 1 cm of root typically adheres only 20 mg of soil (Jacobsen 2004), so one needs (depending on the plant species under investigation) 50 to 500 cm of root material in order to apply a DNA extraction method that requires 1 to 10 g of soil. Several protocols have also been developed for the isolation of metagenomic bacterial DNA from inside plant material. For example, Jiao et al. (2006) describe an indirect method based on enzymatic hydrolysis of plant tissues to release associated microorganisms for subsequent DNA isolation and cloning. While optimized for leaves and seeds, this method seems readily adaptable for use with root material, and thus of great use to the metagenomic exploration of microorganisms in the rhizosphere.

Novel plant growth-promoting genes and gene products

For many of the traits or mechanisms known to be plant growth-promoting (Bloemberg and Lugtenberg 2001), in vitro activity assays have been described and are, at least in theory, exploitable for gain-of-function screenings of a metagenomic library from rhizosphere DNA. For example, antibiotic activity towards (plant-pathogenic) bacteria or fungi can be assessed by testing whole-cell library clones or their extracts in a medium- or high-throughput manner for performance in confrontation assays. Many of such assays have been described using a variety of indicator strains, including some of the most important soil borne pathogens, e.g. the bacteria Erwinia (Emmert et al. 2004) and Xanthomonas (Rangarajan et al. 2003), the fungi Fusarium (Chin-A-Woeng et al. 1998; Kim et al. 2006), and Rhizoctonia (Kim et al. 2006; Rangarajan et al. 2003), and the fungus-like oomycetes Phytophthora (Kim et al. 2006) and Pythium (Kim et al. 2006; Rajendran et al. 1998). Production of the plant hormone indole 3-acetic acid (IAA) by metagenomic library clones can be measured using high-pressure liquid chromatography or colorimetric assays (Bric et al. 1991; Omer et al. 2004; Radwan et al. 2002; Leveau and Lindow 2005), while cytokinins and their metabolites are detectable in supernatants by e.g. immunoaffinity chromatography (Timmusk et al. 1999). Genes for nitrogen fixation are retrievable with the use of nitrogen-free media (Ding et al. 2005; Hashidoko et al. 2002; Tejera et al. 2005). Similarly, genes for the utilization of particular rhizosphere exudates could be recovered using an all-or-none complementation selection for growth on minimal medium containing these exudates as sole source of energy, carbon and/or nitrogen. In a similar approach, clones expressing 1-aminocyclopropane 1-carboxylate (ACC) deaminase, a plant growth-promoting enzyme that lowers plant ethylene levels (Glick et al. 1998), could be selected for by using ACC as sole source of nitrogen, as described previously (Holguin and Glick 2003; Shaharoona et al. 2006). The activities of lytic enzymes are most easily identified through clear zones around colonies on solid media, as has been documented e.g. for several biocontrol chitinases (Basha and Ulaganathan 2002; Gohel et al. 2004; Kobayashi et al. 2002; Leveau et al. 2006). Assays based on halo formation are also available to identify PGPR-related phenotypes such as solubilization of mineral phosphate (Rodriguez et al. 2000) and siderophore production (Lee et al. 2003).

For the functional screening of library clones for PGPR functions, the use of alternative hosts seems very promising and rational, for several reasons. First, there is an abundant availability of phylogenetically diverse culturable PGPRs (Vessey 2003, Lucy et al. 2004) which could improve the probability of finding genes of interest, especially those that are not expressed in E. coli and whose full activity requires a specific PGPR background. The host role could also be played by several of the numerous defined mutants of PGPR that carry a knockout in one or several genes contributing to a particular PGPR phenotype. Such mutants could be useful to screen libraries for heterologous genes and gene functions by a functional complementation approach. Proof-of-principles for such an approach are available in the literature, e.g. single-gene complementation of a mutant of Burkholderia sp. strain PsJN in quinolinate phosphoribosyltransferase (QAPRTase) activity (Wang et al. 2006), a mutant of Pseudomonas putida WCS358 unable to produce the antibiotic pseudobactin 358 (Devescovi et al. 2001), and a mutant of Pseudomonas chlororaphis PCL1391 impaired in the production of the antifungal secondary metabolite phenazine-1-carboxamide (PCN; Girard et al. 2006). Especially interesting in this respect is the use of mutants that lack one or more genes in a multi-gene pathway for the production of antibiotics by enzymes such as polyketide synthases (Staunton and Weissman 2001) and non-ribosomal peptide synthases (Raaijmakers et al. 2006). The modularity underlying such proteins allows for a strategy of combinatorial complementation (Coeffet-Le Gal et al. 2006), possibly leading to the discovery of antimicrobials with new structures and new target specificities (Wenzel and Müller 2005).

Activity screenings such as the ones described above have the potential to retrieve never-before-seen genes with PGPR activity from the metagenomic pool. In contrast, the metagenomic harvest from sequence-based approaches such as PCR and/or Southern hybridizations will inevitably uncover only genes that match the specificity of the primers and/or probes that were used to find them. Nevertheless, screening rhizosphere DNA for PGPR-related genes by PCR (e.g. Juraeva et al. 2006; Sato et al. 1997) or Southern hybridization (e.g. Blaha et al. 2005; Shah et al. 1998) has several advantages in a metagenomic setting. Most importantly, sequencing the flanking regions of such genes on large-insert fosmid or BAC clones could provide insight into the identity of their owner, on the genetic context of these PGPR genes and possibly on the mechanisms of their regulation.

Highly complementary to activity- and sequence-based screenings, a third approach to finding novel plant growth-promoting genes and gene functions is through comparative metagenomics. For one, the rhizosphere can be viewed as an environment that in comparison to the bulk soil is enriched in particular types of microorganisms, including PGPR. Indeed, there is ample evidence that the microbial diversity as measured by phylogenetic markers such as ribosomal RNA genes can differ dramatically between bulk and rhizosphere soil (Costa et al. 2006a; Sanguin et al. 2006). EGT fingerprinting by shotgun sequencing (Tringe and Rubin 2005) or suppressive subtractive hybridization (Galbraith et al. 2004) of bulk and rhizosphere soil compartments could reveal differences in the type of gene adaptations that each compartment selects for. It is expected that genes with PGPR-like functions would be enriched in the rhizosphere library. Similarly, comparison of the genomic diversity of disease-suppressive and non-suppressive soils (Weller et al. 2002) could expose genetic factors that contribute to or are predictive of the suppressiveness towards e.g. pathogenic microorganisms or nematodes.

Characterization of (not-yet-)culturable PGPRs

Due to a historical bias to study those microorganisms that can be grown in the laboratory, there is limited knowledge on the abundance and activity of not-yet-culturable PGPR. However, there are several examples of their existence and contribution to plant health, e.g. Pasteuria penetrans, a not-yet-culturable bacterium parasitic to plant-pathogenic nematodes (Fould et al. 2001), the nitrogen fixing activity by viable-but-not-culturable Azoarcus grass endophytes (Hurek et al. 2002), and the obligate biotrophism of arbuscular mycorrhizal (AM) fungi (Millner and Wright 2002). Bacteria belonging to the Acidobacteria and Verrucomicrobia are in many rhizospheres among the most abundant, difficult-to-culture representatives (Buckley and Schmidt 2003; Gremion et al. 2003). However, it is not clear if and how their abundance is correlated to their contribution towards plant health. A phylogenetic anchoring approach, e.g. using the previously described I-CeuI method (Nesbø et al. 2005) to construct a library over-represented in DNA fragment harbouring 23 S rRNA genes, in combination with a PCR-based screening of this library with Acidobacterial or Verrucomicrobial primers would allow a (partial) insight into the genomes of these bacteria beyond the limited dataset that currently exists for these classes of bacteria (Liles et al. 2003; Quaiser et al. 2003; Wagner and Horn 2006) and into their possibly beneficial effect on plant growth. Major progress is being made in the development of new cultivation techniques (Joseph et al. 2003; Kaeberlein et al. 2002; Stevenson et al. 2004; Zengler et al. 2002), which offers the prospect that many more of the formulated hypotheses based on metagenomic analysis of the rhizosphere and its PGPR constituency (see below) will become testable as the number of culturable representatives from the rhizosphere steadily increases.

An analysis of the rhizosphere by comparative metagenomics holds the promise to reveal several important questions regarding the unculturable fraction of the rhizosphere community. For one, it could expose what actually constitutes this fraction from a comparison of metagenomic DNA isolated directly from rhizosphere to DNA isolated from all the colonies forming on solid media after plating from that same rhizosphere (i.e. the culturable fraction). One could expect a phylogenetic analysis of these two libraries to show differences, based on previous observations (e.g. Sliwinski and Goodman 2004; Costa et al. 2006b), and with large-scale DNA sequencing of both libraries, a start could be made to contrast the genetic diversity of the two populations. Furthermore, by comparison of the functions enriched for in a library from rhizosphere soil versus one from bulk soil, the degree of the selection in each of the compartments for particular microbial activities, specifically those with PGPR relevance, can be estimated. A shotgun sequencing approach for unlocking the unculturable diversity of rhizosphere bacteria, including PGPR, has not yet been reported. Recently, a study was published (Erkel et al. 2006), describing the use of metagenomic sequencing to reconstruct the 3.18-Mbp genome of rice cluster I (RC-I) Archaea with origin in the rice rhizosphere. DNA for the shotgun library was isolated from a methanogenic enrichment culture using rice paddy soil as an inoculum. While RC-I Archaea do not necessarily qualify as PGPR, the study shows that shotgun sequencing in combination with a prior enrichment strategy towards an originally complex rhizosphere population allows the metagenomic analysis of rhizobacteria with a particular function of interest.

Conclusions

In summary, the tools of metagenomics offer many openings into a broadened view of the rhizosphere in general and of PGPR and their activities in particular. Several existing assays for PGPR activity have been listed here and proposed to have immediate utility for the screening of large-insert DNA libraries for gain-of-function phenotypes. The discovery of novel PGPR activities, either by functional screening or based on DNA sequence information, will add enormously to our understanding of the mechanistic variation that exists in PGPR phenotypes. It will also benefit our ability to improve existing PGPR, by adding to the pool of exploitable PGPR genes and utilization of this pool to develop PGPRs with enhanced performance (Downing and Thomson 2000; Glick and Bashan 1997; Holguin and Glick 2003; Timms-Wilson et al. 2004). The use of metagenomics in parallel with established or novel molecular approaches to the study of PGPR, such as genome sequencing of new PGPR isolates (Jeong et al. 2006) and transcriptional profiling of PGPR (Mark et al. 2005; Wang et al. 2005) will undoubtedly lead to the discovery of novel mechanisms of PGPR activity, new types of PGPR identity and a fresh look on the biology and practical application of PGPR.