Trends in Genetics
Volume 24, Issue 3, March 2008, Pages 142-149
Journal home page for Trends in Genetics

Review
Bioinformatics challenges of new sequencing technology

https://doi.org/10.1016/j.tig.2007.12.006Get rights and content

New DNA sequencing technologies can sequence up to one billion bases in a single day at low cost, putting large-scale sequencing within the reach of many scientists. Many researchers are forging ahead with projects to sequence a range of species using the new technologies. However, these new technologies produce read lengths as short as 35–40 nucleotides, posing challenges for genome assembly and annotation. Here we review the challenges and describe some of the bioinformatics systems that are being proposed to solve them. We specifically address issues arising from using these technologies in assembly projects, both de novo and for resequencing purposes, as well as efforts to improve genome annotation in the fragmented assemblies produced by short read lengths.

Section snippets

New technologies: more data and new types of data

The ongoing revolution in sequencing technology has led to the production of sequencing machines with dramatically lower costs and higher throughput than the technology of just 2 years ago. Sequencers from 454 Life Sciences/Roche, Solexa/Illumina and Applied Biosystems (SOLiD technology) are already in production, and a competing technology from Helicos should appear soon. However, the increase in the volume of raw sequence that can be produced from these sequencers is threatening to swamp our

Sequence assembly using SRS technology

The development of automated sequencing technologies has revolutionized biological research by allowing scientists to decode the genomes of many organisms. SRS technologies can accelerate the pace at which we explore the natural world, yet pose new challenges to the software tools used to reconstruct genetic information from the raw data produced by sequencing machines.

Genome resequencing

The near completion of a reference human genome has greatly accelerated research on genetic diversity within our species. Resequencing efforts have thus far targeted individual genes or other genomic regions of interest [3], but advances in SRS technologies have opened up the possibility of whole genome resequencing. The resequencing of multiple strains of several model organisms (e.g. Drosophila melanogaster and Caenorhabditis elegans) and the large-scale resequencing of human cancers are

Assembly of closely related species – mind the gap

Genome scientists have sequenced and assembled the human genome, most model organisms, and almost all major human pathogens to high degrees of accuracy. Many of these genomes – particularly the bacterial and viral species – have been finished, meaning that all chromosomes are sequenced end-to-end with no gaps. Almost as soon as the first genome from each species was published, scientists started to make plans to sequence additional strains and isolates. The dramatically lower cost of sequencing

De novo assembly

Despite a dramatic increase in the number of complete genome sequences available in public databases, the vast majority of the biological diversity in our world remains unexplored. SRS technologies have the potential to significantly accelerate the sequencing of new organisms. De novo assembly of SRS data, however, will require the development of new software tools that can overcome the technical limitations of these technologies. An overview of genome assembly is provided in Box 1.

Studies by

Annotation of genomes sequenced with SRS technology

The highly fragmented assemblies resulting from SRS projects present several problems for genome annotation. The use of SRS technology is so new that few methods have been published describing how current annotation methods can be adapted to account for the various types of sequencing errors that might be present in a genome sequenced with the newer technology.

We can expect that the annotation of genomes sequenced by the new technologies will be reasonably accurate for genes that are found in

Sequencing of transcripts and regulatory elements

The sequencing of transcribed gene products [expressed sequence tags (ESTs)] has long been a vital tool for the characterization of genes in the human genome and other species. EST sequencing also has an important role in the characterization of splice variants and the identification of regulatory signals in a genome—tasks that are not effectively performed through computational means alone. Transcriptome and regulome sequencing projects have been, perhaps, the most successful application of

Annotation of metagenomics projects

One of the most promising applications of SRS technologies is sequencing of environmental samples, also known as metagenomics. In these projects, DNA is purified from an environment such as soil, water or part of the human body, and the mixture of species is sequenced using a random shotgun technique. The resulting reads might originate from hundreds or even thousands of different species, presenting a much greater assembly challenge than a single genome sequencing project.

Currently,

Concluding remarks

Fifteen years of research have shown that, for DNA sequencing technology, longer is better, especially where genome assembly is involved. Someday, perhaps, we will be able to isolate a single chromosome and read it end to end, eliminating the assembly step entirely. At present, however, new short read sequencing (SRS) technologies can sequence so rapidly and so cheaply, that it is clear that SRS is here to stay. Despite their limitations, these still-evolving technologies can replace Sanger

References (52)

  • J.O. Korbel

    Paired-end mapping reveals extensive structural variation in the human genome

    Science

    (2007)
  • M.L. Sogin

    Microbial diversity in the deep sea and the underexplored ‘rare biosphere’

    Proc. Natl. Acad. Sci. U. S. A.

    (2006)
  • S.M. Huse

    Accuracy and quality of massively parallel DNA pyrosequencing

    Genome Biol.

    (2007)
  • M.J. Moore

    Rapid and accurate pyrosequencing of angiosperm plastid genomes

    BMC Plant Biol.

    (2006)
  • D.A. Nickerson

    PolyPhred: automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing

    Nucleic Acids Res.

    (1997)
  • K. Chen

    PolyScan: an automatic indel and SNP detection approach to the analysis of human resequencing data

    Genome Res.

    (2007)
  • R. Koenig

    Tuberculosis. Few mutations divide some drug-resistant TB strains

    Science

    (2007)
  • N. Whiteford

    An analysis of the feasibility of short read sequencing

    Nucleic Acids Res.

    (2005)
  • M. Pop

    Comparative genome assembly

    Brief. Bioinform.

    (2004)
  • M. Chaisson

    Fragment assembly with short reads

    Bioinformatics

    (2004)
  • A. Sundquist

    Whole-genome sequencing and assembly with high-throughput, short-read technologies

    PLoS ONE

    (2007)
  • F. Poly

    Genome sequence of a clinical isolate of Campylobacter jejuni from Thailand

    Infect. Immun.

    (2007)
  • L. Krause

    Finding novel genes in bacterial communities isolated from the environment

    Bioinformatics

    (2006)
  • R.L. Warren

    Assembling millions of short DNA sequences using SSAKE

    Bioinformatics

    (2007)
  • W.R. Jeck

    Extending assembly of short DNA sequences to handle error

    Bioinformatics

    (2007)
  • J.C. Dohm

    SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing

    Genome Res.

    (2007)
  • Cited by (364)

    • Overview of NGS platforms and technological advancements for forensic applications

      2023, Next Generation Sequencing (NGS) Technology in DNA Analysis
    View all citing articles on Scopus
    View full text