Velvet: Algorithms for de novo short read assembly using de Bruijn graphs

Genome Research - Tập 18 Số 5 - Trang 821-829 - 2008
Daniel R. Zerbino1, Ewan Birney1
1EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom

Tóm tắt

We have developed a new set of algorithms, collectively called “Velvet,” to manipulate de Bruijn graphs for genomic sequence assembly. A de Bruijn graph is a compact representation based on short words (k-mers) that is ideal for high coverage, very short read (25–50 bp) data sets. Applying Velvet to very short reads and paired-ends information only, one can produce contigs of significant length, up to 50-kb N50 length in simulations of prokaryotic data and 3-kb N50 on simulated mammalian BACs. When applied to real Solexa data sets without read pairs, Velvet generated contigs of ∼8 kb in a prokaryote and 2 kb in a mammalian BAC, in close agreement with our simulated results without read-pair information. Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies.

Từ khóa


Tài liệu tham khảo

Batzoglou, S. (2005) in Encyclopedia of genomics, proteomics and bioinformatics, Algorithmic challenges in mammalian genome sequence assembly, ed Dunn, M. (John Wiley and Sons, New York) Part 4.

10.1101/gr.208902

10.1016/j.gde.2006.10.009

10.1093/bioinformatics/bti129

10.1093/bioinformatics/bth205

10.1101/gr.6435207

Gross, J.L. Yellen, J. (2004) Handbook of graph theory (CRC Press LLC, Boca Raton, FL).

10.1101/gr.2264004

10.1101/gr.1390403

Idury,, 1995, A new algorithm for DNA sequence assembly, J. Comput. Biol., 2, 291, 10.1089/cmb.1995.2.291

10.1038/35057062

10.1093/bioinformatics/btm451

10.1038/ng.2007.9

10.1126/science.1141319

10.1038/nmeth726

10.1016/0888-7543(88)90007-9

Margulies,, 2005, Genome sequencing in microfabricated high-density picolitre reactors, Nature, 437, 376, 10.1038/nature03959

10.1101/gr.3770505

10.1101/gr.731003

10.1093/bioinformatics/bti1114

10.1126/science.287.5461.2196

10.1073/pnas.171285098

Shah, M.K. Lee, H. Rogers, S.A. Touchman, J.W. (2004) Computational Systems Bioinformatics Conference, An exhaustive genome assembly algorithm using k-mers to indirectly perform n-squared comparisons in O(n) (IEEE, New York), pp 740–741.

10.1186/1471-2105-6-31

10.1371/journal.pone.0000484

10.1126/science.1058040

Warren,, 2007, Assembling millions of short DNA sequences using SSAKE, Bioinformatics, 4, 500, 10.1093/bioinformatics/btl629

10.1038/nature01262