Elsevier

Annals of Epidemiology

Volume 16, Issue 3, March 2006, Pages 157-169
Annals of Epidemiology

Phylogenetic Analysis as a Tool in Molecular Epidemiology of Infectious Diseases

https://doi.org/10.1016/j.annepidem.2005.04.010Get rights and content

Phylogenetics is a powerful tool for microbial epidemiology, but it is a tool that is often misused and misinterpreted by the field. Microbial epidemiologists are cautioned that in order to draw any inferences about the order of descent from a common ancestor it is necessary to correctly root a phylogenetic tree. Epidemiological samples of microbial populations typically include both ancestors and their descendants. In order to illustrate the relationships of those isolates, the phylogenetic method used must be able to detect zero-length branches. Unweighted Pair-Group Method (UPGMA) is the phylogenetic method that is most widely used in microbial epidemiology. Because UPGMA cannot detect zero length branches, and because it places the root of the tree based on a usually-false assumption, UPGMA is the worst possible choice among the several phylogenetic methods available. Because microbial epidemiology deals with relationships among strains within a species, rather than with relationships among species, recombination within those species can render phylogenetic trees meaningless and positively misleading. When there is evidence of significant recombination within the species of interest phylogenetic trees should not be used at all. Instead, alternative tools such as eBURST should be used to understand relationships among isolates.

Introduction

When conducting an epidemiologic study, the goal of that study is generally to determine the cause or the source of some health related phenomenon that affects a population and the distribution of that phenomenon throughout a population (1). While the source of many chronic, behavioral, or noninfectious diseases can be determined by studying the attributes, behaviors, and environment of the population of interest, it is often much more difficult to track the source of an infectious disease within a population by using these methods. The difficulties in tracking the source of an infectious agent occur because the pool of individuals infected by the disease experiences turnover of infected individuals (2), clinical laboratories have limited resources for identifying and reporting cases 2, 3, numerous infectious diseases cause similar symptoms (4), and many infected people do not seek treatment for their infections (5).

For over 30 years, molecular epidemiology has served as a very important tool for studying the spread of infectious diseases (4). Restriction fragment length polymorphisms (RFLP), randomly amplified polymorphic DNA (RAPD), and more recently, multiple locus sequence typing (MLST) have been used to determine the relatedness of bacterial strains (4). While these methods are useful in identifying whether there is a single source or if there are multiple sources of an infectious agent, by themselves, these methods are limited in their ability to identify the source of an infectious strain because they do not tell us anything about the direction in which evolution has occurred. For example, if a noninfectious strain is closely related to an infectious strain, none of these methods tells us whether the infectious strain was derived from the noninfectious strain or vise versa.

Phylogenetic methods can be used to analyze nucleotide sequence data, such as those that are available in MLST analyses in such a way that the order of descent of related strains can be determined. When coupled with appropriate phylogenetic analysis, molecular epidemiology has the potential to elucidate mechanisms that lead to microbial outbreaks and epidemics. Despite the utility of phylogenetics and the inexpensive, readily available software and manuals available for phylogenetic analyses, phylogenetic methods are often inappropriately applied. Even when appropriately applied, they are often poorly explained and are therefore poorly understood. Because phylogenetic analysis is inexpensive, especially when sequence data are already available, and because phylogenetic analysis shows much more clearly how infectious agents are spreading and evolving than sequence data alone, it is important for molecular epidemiologists to understand, to correctly apply, and to correctly interpret phylogenies and phylogenetic methods. This review, while not comprehensive is intended to give molecular epidemiologists an overview of the methods, uses, and interpretations of phylogenetic trees derived from MLST data. Although we will discuss MLST data, the same discussion will apply equally well to viral sequence data and to data obtained by analysis of whole-chromosome restriction fragments that have been separated by pulsed field gel electrophoresis (PFGE).

There are several reasons that, since its introduction in 1998 (6), MLST has become the method of choice for microbial epidemiology. Because the data are DNA sequences the method is reproducible from laboratory to laboratory. Because an MLST scheme for any particular organism involves a defined set of primers for amplifying the DNA sequences, data from different laboratories are directly comparable and can be pooled into a single ever expanding data set that is stored in a single database. The data can be accessed over the Internet (http://www.mlst.net/databases/default.asp), and there are currently MLST databases for 20 microbial pathogens available, with another five schemes in development. The automation of DNA sequencing and the declining cost of automated DNA sequencing have made MLST a practical tool for epidemiologists. Finally, MLST typically has more resolving power than other methods, and is therefore preferable to earlier methods.

Typically, MLST data involves partial sequences, 400–600 base pairs, of several (typically 6–8) housekeeping genes that are dispersed around the bacterial chromosome. The sequence variation between two alleles of a locus is usually in the range of 0.1%–5%. Housekeeping genes are chosen because, being essential to life, they are under moderate to intense purifying selection. As a result, most of the sequence variation is the result of synonymous nucleotide substitutions that are close to selectively neutral. Because neutral variation accumulates approximately linearly with time (the molecular clock), genetic distance between alleles tends to be proportional to the time between divergence of those alleles. Because epidemiologists are usually interested in identifying and tracking pathogenic clones, some of which will have evolved only recently from nonpathogenic ancestors, it is important to have sufficient sequence variation to distinguish clones from close relatives. That level of variation is achieved by sequencing 6–8 loci. The loci are chosen to be well separated and scattered roughly evenly about the chromosome in order to assess the contribution of recombination to the variation as discussed later in the section on the importance of recombination in evaluating phylogenetic relationships.

The data are analyzed by assigning to each unique sequence of a locus an allele number. The allelic profile for an individual is the series of integers that represent the allele numbers at each of the loci, and each unique allelic profile defines a sequence type (ST). Individuals that have the same ST are identical at all of the loci examined and are presumed to be clones unless other reliable data (serotype, pathogenicity, metabolic properties) distinguish them.

Section snippets

Application of Phylogenetic Methods to Epidemiology

For the sake of this review, let us assume that we have a sample that includes many isolates of a pathogen that is responsible for a disease outbreak. The pathogen might be a microorganism or it might be a virus. In the former case, our raw data will probably be the complete or partial sequences of several genes (MLST); in the latter case, it will be the complete or partial sequences of the viral genome. In either case, we want to know how those isolates are related to each other, how they are

Conclusion

There are three aspects of the data that distinguish molecular phylogenetics of microbial populations from the typical molecular phylogenetics of different species: (1) recombination can have a substantial impact on the validity of phylogenetic trees, (2) sequences tend to be very closely related, and (3) both an isolate and its evolutionary descendant are often present in a sample. These aspects present problems that are not often encountered in typical evolutionary studies, and both the

References (31)

  • M. Achtman

    A phylogenetic perspective on molecular epidemiology

  • B.G. Spratt et al.

    Displaying the relatedness among isolates of bacterial species—the eBURST approach

    FEMS Microbiol Lett

    (2004)
  • K.J. Rothman

    Epidemiology: An Introduction

    (2002)
  • R.H. Xu et al.

    Epidemiologic clues to SARS origin in China

    Emerg Infect Dis

    (2004)
  • T.F. Jones et al.

    Limitations to successful investigation and reporting of foodborne outbreaks: An analysis of foodborne disease outbreaks in FoodNet catchment areas, 1998–1999

    Clin Infect Dis

    (2004)
  • P.R. Murray et al.

    Manual of Clinical Microbiology

    (1999)
  • C.C. Tam et al.

    The study of infectious intestinal disease in England: What risk factors for presentation to general practice tell us about potential for selection bias in case-control studies of reported cases of diarrhea

    Int J Epidemiol

    (2003)
  • M.C. Maiden et al.

    Multilocus sequence typing: A portable approach to the identification of clones within populations of pathogenic microorganisms

    Proc Natl Acad Sci USA

    (1998)
  • S. Kumar et al.

    MEGA2: Molecular evolutionary genetics analysis software

    Bioinformatics

    (2001)
  • Swofford DL. 2000 PAUP∗. Phylogenetic Analysis Using Parsimony (∗and Other Methods). Sinauer Associates, Sunderland,...
  • Felsenstein J 2004 PHYLIP (Phylogeny Inference Package), version 3.6. Distributed by the author, Department of Genome...
  • N. Saitou et al.

    The neighbor-joining method: A new method for reconstructing phylogenetic trees

    Mol Biol Evol

    (1987)
  • E. Meats et al.

    Characterization of encapsulated and noncapsulated Haemophilus influenzae and determination of phylogenetic relationships by multilocus sequence typing

    J Clin Microbiol

    (2003)
  • D.F. Feng et al.

    Progressive sequence alignment as a prerequisite to correct phylogenetic trees

    J Mol Evol

    (1987)
  • J.D. Thompson et al.

    Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice

    Nuc Acid Res

    (1994)
  • Cited by (30)

    • Threat, challenges, and preparedness for future pandemics: A descriptive review of phylogenetic analysis based predictions

      2022, Infection, Genetics and Evolution
      Citation Excerpt :

      DNA viruses with small genomes that evolve at similar rates to RNA viruses will be equally suitable for phylodynamic analysis. When applied to slower-evolving DNA viruses, evolutionary analysis can help elucidate longer-term processes, such as host-pathogen co-divergence and pathogen speciation (Hall and Barlow, 2006). Thus, a clear goal for the future is to further develop analytic methods that combine genetic and epidemiological data to reconstruct epidemic history and to predict future trends.

    • Pathogen typing in the genomics era: MLST and the future of molecular epidemiology

      2013, Infection, Genetics and Evolution
      Citation Excerpt :

      Bacterial species typically exist as clusters of genetically related strains (Acinas et al., 2004), but finding those clusters may not be straightforward since high rates of recombination can certainly render meaningless and misleading phylogenetic trees (Posada and Crandall, 2002). In addition, isolates tend to be very closely related and frequently both the parent strains and their descendants are included in the same sample (Hall and Barlow, 2006). Thus, recombination requires a different paradigm for visualizing genealogical relationships as networks instead of trees (Posada and Crandall, 2001) and special approaches for estimating population genetic parameters that accommodate the biological reality of recombination (Schierup and Hein, 2000).

    • Detection of the first G6P[14] human rotavirus strain from a child with diarrhea in Egypt

      2011, Infection, Genetics and Evolution
      Citation Excerpt :

      Phylogenetic trees were constructed using the neighbor-joining method with genetic distance calculated using the Kimura 2-step algorithm. Bootstrap analysis (Felsenstein, 1985) was performed using 2000 samplings and values below 70% were excluded as non-significant (Hall and Barlow, 2006). Phylogenetic analyses included representative P[14] and G6 rotavirus sequences available in GenBank, in addition to other human and animal rotavirus strains.

    View all citing articles on Scopus
    View full text