Genome-wide mapping and assembly of structural variant breakpoints in the mouse genome
- Aaron R. Quinlan1,
- Royden A. Clark1,
- Svetlana Sokolova1,
- Mitchell L. Leibowitz1,
- Yujun Zhang2,
- Matthew E. Hurles2,
- Joshua C. Mell3 and
- Ira M. Hall1,4,5
- 1 Department of Biochemistry and Molecular Genetics, University of Virginia School of Medicine, Charlottesville, Virginia 22908, USA;
- 2 Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, United Kingdom;
- 3 Department of Zoology, University of British Columbia, Vancouver, British Columbia V6T 3Z4, Canada;
- 4 Center for Public Health Genomics, University of Virginia, Charlottesville, Virginia 22908, USA
Abstract
Structural variation (SV) is a rich source of genetic diversity in mammals, but due to the challenges associated with mapping SV in complex genomes, basic questions regarding their genomic distribution and mechanistic origins remain unanswered. We have developed an algorithm (HYDRA) to localize SV breakpoints by paired-end mapping, and a general approach for the genome-wide assembly and interpretation of breakpoint sequences. We applied these methods to two inbred mouse strains: C57BL/6J and DBA/2J. We demonstrate that HYDRA accurately maps diverse classes of SV, including those involving repetitive elements such as transposons and segmental duplications; however, our analysis of the C57BL/6J reference strain shows that incomplete reference genome assemblies are a major source of noise. We report 7196 SVs between the two strains, more than two-thirds of which are due to transposon insertions. Of the remainder, 59% are deletions (relative to the reference), 26% are insertions of unlinked DNA, 9% are tandem duplications, and 6% are inversions. To investigate the origins of SV, we characterized 3316 breakpoint sequences at single-nucleotide resolution. We find that ∼16% of non-transposon SVs have complex breakpoint patterns consistent with template switching during DNA replication or repair, and that this process appears to preferentially generate certain classes of complex variants. Moreover, we find that SVs are significantly enriched in regions of segmental duplication, but that this effect is largely independent of DNA sequence homology and thus cannot be explained by non-allelic homologous recombination (NAHR) alone. This result suggests that the genetic instability of such regions is often the cause rather than the consequence of duplicated genomic architecture.
Footnotes
-
↵5 Corresponding author.
E-mail irahall{at}virginia.edu.
-
[Supplemental material is available online at http://www.genome.org. The sequence data generated for this study have been submitted to the Short Read Archive (http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi) under accession no. SRA010027. Structural variant calls have been submitted to dbVAR (http://www.ncbi.nlm.nih.gov/projects/dbvar/) under accession no. nsdt19. Source code for the HYDRA algorithm is available at http://code.google.com/p/hydra-sv/.]
-
Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.102970.109.
-
- Received November 5, 2009.
- Accepted March 9, 2010.
- Copyright © 2010 by Cold Spring Harbor Laboratory Press