FLASH: fast length adjustment of short reads to improve genome assemblies

Bioinformatics - Tập 27 Số 21 - Trang 2957-2963 - 2011
Tanja Magoč1, Steven L. Salzberg1
1McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA

Tóm tắt

Abstract

Motivation: Next-generation sequencing technologies generate very large numbers of short reads. Even with very deep genome coverage, short read lengths cause problems in de novo assemblies. The use of paired-end libraries with a fragment size shorter than twice the read length provides an opportunity to generate much longer reads by overlapping and merging read pairs before assembling a genome.

Results: We present FLASH, a fast computational tool to extend the length of short reads by overlapping paired-end reads from fragment libraries that are sufficiently short. We tested the correctness of the tool on one million simulated read pairs, and we then applied it as a pre-processor for genome assemblies of Illumina reads from the bacterium Staphylococcus aureus and human chromosome 14. FLASH correctly extended and merged reads >99% of the time on simulated reads with an error rate of <1%. With adequately set parameters, FLASH correctly merged reads over 90% of the time even when the reads contained up to 5% errors. When FLASH was used to extend reads prior to assembly, the resulting assemblies had substantially greater N50 lengths for both contigs and scaffolds.

Availability and Implementation: The FLASH system is implemented in C and is freely available as open-source code at http://www.cbcb.umd.edu/software/flash.

Contact:  [email protected]

Từ khóa


Tài liệu tham khảo

Gnerre, 2011, High-quality draft assemblies of mammalian genomes from massively parallel sequence data, Proc. Natl Acad. Sci. USA, 108, 1513, 10.1073/pnas.1017351108

Kelley, 2010, Quake: quality-aware detection and correction of sequencing errors, Genome Biol., 11, R116, 10.1186/gb-2010-11-11-r116

Kurtz, 2004, Versatile and open software for comparing large genomes, Genome Biol., 5, R12, 10.1186/gb-2004-5-2-r12

Langmead, 2009, Ultrafast and memory efficient alignment of short DNA sequences to the human genome, Genome Biol., 10, R25, 10.1186/gb-2009-10-3-r25

Li, 2009, The sequence alignment/map (SAM) format and SAMtools, Bioinformatics, 25, 2078, 10.1093/bioinformatics/btp352

Li, 2010, De novo assembly of human genomes with massively parallel short read sequencing, Genome Res., 20, 265, 10.1101/gr.097261.109

MacCallum, 2009, ALLPATHS2: small genomes assembled accurately and with high continuity from short paired reads, Genome Biol., 10, R103, 10.1186/gb-2009-10-10-r103

Miller, 2008, Aggressive assembly of pyrosequencing reads with mates, Bioinformatics, 24, 2818, 10.1093/bioinformatics/btn548

Rodrigue, 2010, Unlocking short read sequencing for metagenomics, PloS One, 5, e11840, 10.1371/journal.pone.0011840