1 Introduction

Pangenomes of bacterial species show a tremendous range of diversity in size, content, and fluidity. In comparing the core genome size in relation to the accessory genome, some species possess relatively limited pangenomes while others are expansive. Accessory genomes may be composed of genes belonging to phages, transposons, insertion sequences, and plasmids, as well as genes that have diverged through mutation and recombination to the point where they are considered as a separate homolog. Some of these genomic elements may be relatively stable (e.g., an integrated prophage), while others may be gained and lost within a single bacterial culture (e.g., plasmids). In this chapter, we will discuss the population genomics of pan-, and more specifically, accessory genomes, specifically detailing how accessory genomes vary among and within bacterial species and the implications this variation has for microbial ecology. Throughout this discussion, it is important to not lose sight of what we are referring to with the catch-all phrase “accessory.” These are the dynamic elements of the genome, often containing large genomic islands that augment the bacterium’s phenotype, which may, as we will outline, be used to glean knowledge of ecology and evolutionary history of a genus, species, or set of lineages. Further, in no way does the term accessory or the misleading synonym “dispensable” suggest non-essential, as some “accessory” genes actually represent divergent variants of an essential gene.

2 Mechanisms of Pangenome Variation

The content and diversity of a bacteria’s accessory genome are directly associated with the mode and frequency of horizontal gene transfer (HGT), which in turn is tightly linked to ecology. Modes of HGT include transformation: the uptake and integration of exogenous DNA from the environment, transduction: the introduction of exogenous DNA into the bacterial cell through a viral vector (e.g., bacteriophage), and conjugation: the direct transfer of DNA between two bacterial cells through a pilus, which usually involves plasmids and transposons. Bacteria vary in the degree to which each of these mechanisms occurs within their populations and in their DNA uptake mechanisms. It is also almost certain that other variants of these mechanisms remain to be discovered, as illustrated by recent work describing “lateral transduction” capable of transferring genomic regions of remarkable size (Chen et al. 2018).

Integrative and conjugative elements (ICE) include integrative plasmids and conjugative transposons, which are circularized mobile elements transferred through conjugation. ICE may harbor a number of genes important to virulence, specialized metabolism, and survival, and are the primary means by which antibiotic-resistant genes are transmitted among bacteria. Plasmids may contain anywhere from 5 to 100 or so genes, allowing for a lineage to gain or lose many loci in a single step, especially for those species with high plasmid diversity. Phylum Proteobacteria, which includes several pathogenic species from genera Escherichia, Salmonella, Vibrio, Helicobacter, Yersinia, and Legionellales possess some of the most prevalent and diverse plasmids with a wide host range (Shintani et al. 2015). Therefore, unsurprisingly, species among these genera have moderately large pangenome sizes (McInerney et al. 2017).

Naturally, competent (transformable) species are able to uptake DNA directly from the environment resulting in homologous or nonhomologous recombination, the latter frequently associated with gene gain (Croucher et al. 2012). Arguably, the most famous of these species, Streptococcus pneumoniae, was made so by its role in the Griffith experiments in 1928, which led to the identification of DNA as the conveyor of genetic information. Through those experiments, Griffith observed that “smooth” (i.e., unencapsulated) avirulent S. pneumoniae could become virulent through exposure to heat-killed virulent “rough” (i.e., encapsulated) pneumococci (Griffith 1928). We now know that what he observed was transformation resulting in the acquisition of the capsular polysaccharide (CPS) loci that code genes responsible for the synthesis and polymerization the antigenic serotype capsule. There are over 90 serotypes identified and the CPS loci span 10,337–30,298 bp with at least 26 coding sequences depending on the particular serotype (Bentley et al. 2006). Therefore, this single recombination event resulted in the acquisition of 26 accessory genes. Since then, other species including Neisseria gonorrhoeae, Campylobacter jejuni, Vibrio cholerae, and Haemophilus influenzae have been found to be naturally competent.

Another method by which transformation may result in differences in gene content is through events that lead to gene diversification, which are frequently observed among several species as recombination “hotspots.” The primary effect of these events is antigenic variation in genes linked to host–pathogen interactions. For example, among pneumococci, two virulence factors, pneumococcal surface proteins A and C (pspA and pspC), are known to have 3 and 11 variants, respectively (Hollingshead et al. 2000; Iannelli et al. 2002). These variants are diverse in length, structural organization, and nucleotide variation, the results of frequent recombination events. Most important, they are different in serology, which has significant implications for host immunity (Azarian et al. 2016; Georgieva et al. 2018). Similarly, among gonococci, the opa and neighboring pil loci are highly mosaic due to recombination of existing alleles (Bilek et al. 2009). The gene product Opa is an outer membrane adhesion protein that is important for colonization and invasion of the genital and nasopharyngeal mucosal epithelium. As a note, antigenic variation through recombination leads to an interesting contradiction in terminology. In both of these examples, pspA, pspC, opa, and pil are considered “core” genes in the sense that each member of their respective species possesses a variant. They are by all definitions “essential” to core cell function; yet, through current methods of pangenome analysis that are commonly based on a nucleotide homology level of at least 80%, they are identified operationally as accessory genes. Finally, transduction through temperate bacteriophages may introduce considerable gene variation in both Gram-negative and Gram-positive bacteria (Feng et al. 2008; Waldor and Friedman 2005). While their precise evolutionary impact in most cases remains unclear, it is certain that their pathogenesis plays a significant role in the biology of their host. For example, many phages harbor genes coding for virulence factors including toxins or secreted enzymes (Romero et al. 2009); therefore, prophages (bacteriophages integrated into host bacterial genomes) represent a significant mechanism for variation of virulence among closely related bacteria (Fortier and Sekulovic 2013). In relation to pangenome dynamics, the transmission of bacteriophages can result in significant variation among bacterial populations on short timescales by two mechanisms: through (1) the direct integration of the prophage and (2) the acquisition or evolution of antiphage mechanisms. The later may involve phage-inducible chromosomal islands and CRISPR-Cas systems (Reyes-Robles et al. 2018), which independently represent instances of gene acquisition and a source of pangenome variation. Predator–prey dynamics of bacteriophages and their host has been widely observed with Siphoviridae phages and S. pneumoniae (Romero et al. 2009), lamba STX-coding phage in Shiga toxin-producing E. coli, and ICP (Myoviridae) and CTX phages in Vibrio cholerae (Seed et al. 2011; Waldor and Friedman 2005), among myriad others. The result is highly variable prophage content even within closely related members of bacterial lineages (Croucher et al. 2014).

3 Population Genomics of Pangenomes

Today, the identification of a bacterial sample’s core genome is a common intermediate step among bioinformatics pipelines for preparing whole-genome sequencing data for phylogenetic analysis. Historically, the accessory genome was largely ignored with the exception of the identification of important genes such as those conferring antibiotic resistance or increased virulence. Methodologically, it was difficult to scale accessory genome analysis to large population samples of a species and especially across several species. Then, the discovery that in three diverse E. coli isolates, less than 40% of the genes was found in the genomes of all three demonstrated that extensive variation was possible (Welch et al. 2002). A subsequent study of just eight genomes of Streptococcus agalactiae (Group B Streptococcus) published in 2005 identified 1806 core genes and 439 “dispensable” genes, highlighting that tremendous variation could be observed with even a small sample (Tettelin et al. 2005). This chapter introduced the concept of the pangenome. Now, large-scale analyses of pangenomes continue to reveal significant diversity even over short timescales, providing information about the demographic history and adaptive evolution of bacteria. These studies have shown that pangenome size and diversity vary among species and depend on lifestyle (McInerney et al. 2017; Ochman and Davalos 2006).

McInerney and colleagues recently summarized the range of diversity observed among bacterial species (McInerney et al. 2017). Pangenome sizes ranged from 974 for the obligate intracellular bacteria Chlamydia trachomatis to 40,362 for the semiaquatic agricultural Oryza sativa. Comparing sizes of accessory genomes in relation to the total number of genes in the pangenome, O. sativa had the smallest, just 8% of genes were accessory, while in Salmonella enterica, a staggering 83% of its 10,267 genes are found in the accessory genome. Assessing “genomic fluidity” is another method for quantifying pangenome diversity (Kislyuk et al. 2011). Instead of assessing the relationship between core and accessory genome size, genomic fluidity measures the dissimilarity of genomes evaluated at the gene level calculated as the “ratio of unique gene families to the sum of gene families in pairs of genomes averaged over randomly chosen genome pairs from within a group of N genomes.” In a comparison of genomic fluidity among seven species known to undergo HGT, Neisseria meningitidis, Escherichia coli, and Streptococcus spp. ranked highest in genomic fluidity (Kislyuk et al. 2011) (although it should be noted that this metric is expected to be affected by the sample chosen for study).

Within a species, accessory genome diversity increases with core genome divergence and models of homologous recombination and HGT have shown how these processes lead to the formation of population structure (Croucher et al. 2014; Marttinen et al. 2015). Boundaries for HGT across species roughly follow the same trajectories. Species in genera Streptococcus, Neisseria, and Campylobacter, for example, have been shown to engage in HGT more frequently with closely related members (e.g., between S. pneumoniae and S. mitis and S. oralis, and N. gonorrhoeae and N. meningitidis). Therefore, the size and distribution of accessory genes in a population provide insights into the demographic history of bacterial species as well as delineations of species boundaries.

As we have described, many methods can generate accessory genome diversity. While not wholly analogous to the way nucleotide mutations arise and propagate in a population, the gain and loss of genes nonetheless inform the shared evolutionary history of a population in the same manner. Genomic islands acquired through HGT often become relatively fixed in bacterial lineages (Croucher et al. 2014) with the number of acquired genes increasing with lineage age (Donati et al. 2010). This is especially true for Staphylococcal Cassette Chromosome mec (SCCmec) elements in clones of S. aureus (International Working Group on the Classification of Staphylococcal Cassette Chromosome Elements (IWG-SCC) 2009), pathogenicity islands among toxigenic and non-toxigenic lineages of V. cholerae (Wozniak et al. 2009), and CPS loci in pneumococci (Bentley et al. 2006). These mobile elements, therefore, inform long-scale evolutionary history, while in the short term, prophage variation and the scars of transformation events reflect more recent events. As such, it is possible to recapitulate the core genome phylogeny of a population through phylogenetic reconstruction using a presence–absence alignment of accessory genes, represented by 1’s and 0’s, respectively (Azarian et al. 2018). In essence, this represents a tight linkage between core genome single nucleotide polymorphisms and the history of gene gain and loss. This may, of course, oversimplify the complex interconnected processes that led to accessory gene variation, but it does provide an easy data structure that may be investigated to understand how bacterial populations change over time.

An interesting approach to assessing temporal changes in bacterial population genomics is to consider the dynamics of the accessory genome. The clearest examples of this are observations of rapid changes in virulence or antibiotic resistance among bacterial lineages, often leading to short-term success of a clone (Croucher et al. 2014). The impact of human interventions, namely vaccines, affects not only the distribution of lineages in a population but also the available pool of accessory genes. For example, if an ICE is strongly associated with a lineage, and that lineage is targeted by vaccine, then the removal of the lineage from the population may ultimately remove the reservoir for that ICE. The impact of vaccine on the pathogen population of S. pneumoniae has been extensively studied (Azarian et al. 2018; Croucher et al. 2013). After the introduction of the seven-valent pneumococcal conjugate vaccine (PCV7) in the USA, an analysis of a sample of 616 genomes of pneumococci carried in children in Massachusetts showed the removal of accessory genes associated with the CPS loci of vaccine serotypes (Croucher et al. 2013). In addition, the prevalence of antibiotic-resistant genes associated with two transposons was shifted due to the removal of two vaccine lineages they were associated with and the subsequent emergence of a non-vaccine lineage harboring one of the transposons. A study of pneumococcal population dynamics over 13 years and spanning the introduction of the PCV7 showed that the introduction of vaccines greatly shifted the frequencies of accessory genes in the population (Azarian et al. 2018). Surprisingly, the frequencies of accessory genes then shifted back to pre-vaccine values as the pneumococcal population recovered from the removal of nearly 30% prevalent genotypes targeted by vaccine. This observation was elucidated by recent work by Corander and colleagues who investigated accessory gene frequencies across of 4127 pneumococcal isolates from four distinct geographic areas (Corander et al. 2017). They found that accessory genes had similar frequencies in the four populations despite significant differences in lineage composition and the timing of vaccine use. Through functional analysis of the accessory genes and population dynamic modeling, they proposed that the frequencies of accessory genes are shaped by negative frequency-dependent selection (NFDS) through pathogen–pathogen, host–pathogen, and pathogen–environment interactions. Classically defined, in an NFDS model the fitness of a phenotype depends on its frequency relative to other phenotypes in a population. The same NFDS model has been used to explain the diversity of protein antigens among pneumococci, which we briefly touched upon early in the chapter. In the case of protein antigens, increasing host immunity toward an antigen drives diversification of the gene coding for the protein either through mutation, or most often, recombination. The same dynamic can be observed with prophages and restriction modification systems that defend against infection. Ultimately, these observations point to a central hypothesis for accessory genome variation, that difference in gene content are linked to adaptation and niche specialization, but that in the case of NFDS the niche may be dynamically generated by fluctuating frequencies of loci in the pangenome.

4 The Ecological Significance of Pangenomes

The observation of pangenomes as a common feature of many bacteria begs the question of what has selected them? What are the ecological features that lead to the pervasive association of a core, with a disseminated complement of many additional genes, some shared with other species? While some have clear selective consequences, most are obscure. The extent to which bacteria vary in gene content sets them apart from eukaryotes, and is just one of the reasons we cannot easily transfer population genetic concepts between the superkingdoms of life. One metaphor for bacteria and their varying genome content compares them to modern smartphones (Young 2016) in which the core genome is the operating system, the accessory genome is the apps downloaded to the phone, and the pangenome would be everything in the app store. In the following, we divide up the accessory genes that combine to make up a pangenome into various categories, not by function but by how they are distributed among lineages in the population.

The perspective we take is of the bacterial genome as a transient construct. Loci can be added to it, and selected to become more common or indeed lost from the population, should they no longer be necessary. The pangenome for any sample is the totality of genes currently associated with its contents. This need not be a permanent or even especially long relationship. Consider a locally prominent prophage, which might not be present in the same population if you returned at a later date. Indeed we can imagine that given the many ways bacteria engage in HGT, a sample of sufficient size will contain many loci in a new genetic background that are yet to be lost (analogous to incomplete purifying selection (Rocha et al. 2006). A subset of the pangenome, expected to be rare in any reasonably large sample, is genes that are either infrequently obtained or actively selected against. In general, the extent of gene flow will be regulated by the genetic and ecological similarity of the bacteria and the compatibility of the genetic background to adapt to the acquisition of novel genes (Wiedenbeck and Cohan 2011).

Moving to loci that are present at intermediate frequencies, say between 5% and 95% of isolates, we can distinguish between loci that are restricted to a few lineages, or are widely disseminated but not fixed in any lineage. These suggest different evolutionary scenarios. Dealing with the latter first; a locus that is easy to obtain but hard to hold onto suggests fluctuating selection. We see it more often than the genes in the previous category, because it provides selective benefits. However, these are not consistent benefits or we would expect the gene to rapidly become more common and indeed part of the core. Examples of these include drug-resistant genes in lineages that lack compensatory mutations, and as such only experience a selective benefit in the presence of the drug (Blanquart et al. 2018; Cobey et al. 2017; Lehtinen et al. 2017).

In contrast, loci fixed in a lineage might represent the ecological “address” of those bacteria, a dimension of their niche. However, this need not be the case. Studies of populations of S. pneumoniae have shown that the accessory loci in this species are not widely disseminated, but are also rarely restricted to a single lineage and are instead shared among several, in different combinations (Croucher et al. 2014). It has been suggested that different combinations of accessory loci might be selected in different populations, depending on the overall frequencies of the individual genes, as a result of negative frequency-dependent selection (Corander et al. 2017). At present, this remains a hypothesis without definitive proof.

We must also recognize that a locus might have no wider ecological significance whatsoever. Toxin–antitoxin genes can drive their own acquisition and maintenance, to say nothing of the multitudes of transposable elements, prophage and the like (Wozniak and Waldor 2009). Bacterial genomes are characterized not only by their variable gene content, and the transience of the associations between loci (long for core genes, short for others) but by the divergent selective processes affecting them. In some cases (the core) these are aligned, while in others they are not. Population geneticists who study sexually reproducing eukaryotes are familiar with the notion that the selective interests of different loci in the same genome may differ. The shuffling of genetic information in each generation effectively uncouples the association between all but the closest loci, but even the most frequently recombining bacteria (Arnold et al. 2018) do not approach the state of sexually reproducing eukaryotes. As a result, the overall fitness of a bacterial genome is the product of all the loci making it up. To preserve this overall fitness, it has been proposed that homologous recombination in bacteria is an adaptation to prevent the colonization of the genome with selfish genetic elements, by rapidly replacing them with the homologous region in the ancestral strain, which lacks the additional gene (although this does not explain the notable variation among bacteria in their recombination rates) (Croucher et al. 2016). One of the greatest challenges in providing a satisfying account of bacterial population genetics has been separating the patterns that are the result of selection, from those of linkage.

The question of how the individual loci that make up the accessory or dispensable part of the pangenome, associate themselves with the lineages that are defined by the core component, has come under increasing scrutiny as the numbers of population genome samples have increased. Population genetic models for the core genome specifically developed with bacteria in mind, and capable of handling the various amounts of homologous recombination, are not common. Rarer still are models that explicitly consider the gain and loss of genes from the accessory genome. Although gain and loss of loci is not unknown in eukaryotes, and has been implicated in some major adaptive events (McInerney 2017; Schönknecht et al. 2014) it is nowhere near as extensive and does not have anything like the impact it does in bacteria. Accurate models for such processes are crucial to detect departures from neutrality, and several studies have actually found apparently neutral associations between elements of the pangenome and the core. However there are reasons to think that the sequence variation associated with the accessory genome may produce fundamentally different results from those in population genetics textbooks. For example, if the site frequency spectrum expected under neutral assumptions is extended to allow mutations in loci that can be gained or lost, systematic bias results (Baumdicker 2015; Baumdicker et al. 2012; Collins and Higgs 2012).

Given that the accessory fraction of the pangenome is enriched for loci involved in properties from toxin production, to restriction-modification systems, and surface antigens, to say nothing of drug-resistant genes, it is hard to imagine that it might fit well to a neutral model—in several cases though, it does (Baumdicker et al. 2012; Marttinen and Hanage 2017). This result is hard to accept, and it should be given all we know about the power of selection and the size of bacterial populations. However, it should be appreciated that a multitude of selective scenarios can produce a signal that is hard or impossible to distinguish from neutrality. Study of other metrics may be required to unveil the underlying processes. For instance, the rates with which diverging strains of pneumococcus acquired or lost genes was found to be indistinguishable from neutrality and even to yield good estimates of the population mutation and recombination rates (Marttinen and Hanage 2017). Yet later analysis of the same population, alongside others, was interpreted as strong evidence for negative frequency-dependent selection on the accessory fraction of the genome (see above). What is going on?

A possible explanation lies in the central limit theorem. If an outcome is determined by many independent random variables, each with finite variance, then we expect the result of adding them all together to be a normal distribution. In other words, if the fitness of a strain is the consequence of many independent factors, we might find it appears neutral—the chances of any individual getting into the next generation could be normally distributed around 50:50. This result has been the source of substantial interest in ecology, given that it can be used to show that species abundance distributions (SADs—a common metric for summarizing ecological diversity (McGill et al. 2007) can appear neutral while actually being the result of many non-neutral processes. In the case of bacteria, the fitness effects of genes on the same mobile element may not be independent, however, the effects of multiple mobile elements may similarly approximate to an overall strain fitness not distinguishable from neutrality. Other models from community ecology may be useful in determining the contents of genomes, as well as ecosystems.

Nevertheless, the current consensus in the field is that gene variation directly reflects the ecological niche occupied by the bacteria (Sheppard et al. 2018) and the response to local selective pressures (Cordero and Polz 2014). This may involve the acquisition of antibiotic-resistant genes, as described above, metabolic genes needed to exploit a novel energy source, bacteriocins for microbial warfare, or phage and phage-defense genes involved in predation–prey “paper-rock-scissor” dynamics, as so eloquently described by Corander and colleagues (Arnold et al. 2018). Further, it is suspected that rapid acquisition and dissemination of genes most often occurs as bacterial clones adapt to a novel niche previously occupied by another species (Polz et al. 2013; Popa et al. 2011; Smillie et al. 2011; Vos et al. 2015). An example of this would be the acquisition of IncA/C plasmid by Vibrio cholerae introduced to Haiti, a country previously devoid of epidemic cholera for at least 100 years (Carraro et al. 2016) as well as the post-vaccine population of S. pneumoniae in the USA, which experienced a significant population shift after the 7-valent pneumococcal conjugate vaccine removed approximately 30% of the pre-vaccine population. Niches themselves are not explicitly segregated, and therefore one does not have to be vacated to then be exploited by a newcomer. Gene flow may occur between sympatric lineages; i.e., habitat borders are not defined by walls or other barriers, and recombination can occur among lineages of a species where habitat space is not clearly demarcated (Marttinen and Hanage 2017). This model explains lineage divergence and population structure among several species, and is important because it highlights that a species requires not only the ability to acquire genes but also the opportunity to do so. Interestingly, it has also been suggested that once a competent species encounters a new niche, it can give rise to noncompetent lineages, providing an advantage when adaptation through gene acquisition is not required and may, in fact, be deleterious (Jorth and Whiteley 2012).

The acquisition of genes is not always beneficial and may, in fact, be deleterious (Vos et al. 2015). Indeed, for every successful lineage that is observed, there are likely several “failed” ecological experiments. Since there is not a clear delineation between the fitness gain and costs of gene acquisition, it may be an oversimplification of the dynamics to ascribe a net-positive or net-negative effect of gene gain and loss. The truth, of course, is somewhere in between and likely varies to different degrees between species. To offset fitness costs and compensate for the acquisition of mobile elements, mutations may arise in core loci and form epistatic relationships with the acquired gene. This has been suggested, for example, in E. coli, where nucleotide substitutions in regulatory genes were found to be associated with the acquisition and maintenance of accessory genes (McNally et al. 2016). This dynamic is further supported, in part, by recent findings of epistatic interactions across genome-wide loci among multiple bacterial species (Arnold et al. 2018; Skwark et al. 2017).

There are examples where ecological niches are clearly defined among species and others where the relationships between habitat and organism are obscured. In the E. coli study (McNally et al. 2016), ecological adaptation and niche segregation were not observed among isolates collected from humans and animals, while in other species such as Campylobacter, this is commonly observed (Sheppard et al. 2011). Methods to investigate gene flow and selection in the context of adaptation and ecology are continually being refined. In some instances, identifying the appropriate system to test ecological hypotheses is the limiting step. An intriguing approach to understanding these associations is not to identify niches, the organisms that inhabit them, and then attempt to resolve the genes associated with adaptation, but instead first assess gene flow and then make predictions about ecology. So-called “reverse-ecology” proposed by Shapiro and Polz seeks to investigate habitat specificity by assessing gene flow and gene-specific sweeps, and has been used to predict ecological differentiation of Vibrio spp. in aquatic environments (Hunt et al. 2008; Shapiro et al. 2012; Shapiro and Polz 2014). They demonstrate an example of applying a fresh perspective to an appropriate model system to understanding bacterial ecology.

Taken together, the accumulation of population samples that have been analyzed with modern genomic methods has greatly improved our understanding of the pangenome, and its ecological significance. The totality of loci in a sample includes the essential core, together with a set of accessory loci that have a range of ecological and evolutionary significance: from functional genes with direct relevance to niche such as those described in the reverse ecology approach of Shapiro and Polz, to more selfish elements such as toxin–antitoxin systems. One feature of the current landscape of bacterial genomics that is not often noticed, is that for all the references in the literature to “Whole Genome Sequencing,” few studies actually determine the whole, i.e., finished genome including all plasmids. Our current understanding is overwhelmingly based on high-quality draft, not finished, genomes. The emergence of long-read technologies is changing this, and as they improve and become more economical (together with more methods for making hybrid assemblies from short- and long-read data) we may find that our current understanding underestimates the actual quantities of sequence variation in bacteria and that there are short regions under strong selection that accumulate rapid change and are hence hard to assemble from short-read data. Adding these is just one of the exciting directions for research over the next few years, which is sure to improve our understanding of pangenomes and their significance far beyond our current knowledge.