Canu: scalable and accurate long-read assembly via adaptivek-mer weighting and repeat separation

Genome Research - Tập 27 Số 5 - Trang 722-736 - 2017
Sergey Koren1, Brian Walenz1, Konstantin Berlin2, Jason Miller3, Nicholas H. Bergman4, Adam M. Phillippy1
11Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
22Invincea Incorporated, Fairfax, Virginia 22030, USA
33J. Craig Venter Institute, Rockville, Maryland 20850, USA
44National Biodefense Analysis and Countermeasures Center, Frederick, Maryland 21702, USA

Tóm tắt

Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based ontf-idfweighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences (PacBio) or Oxford Nanopore technologies and achieves a contig NG50 of >21 Mbp on both human andDrosophila melanogasterPacBio data sets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.

Từ khóa


Tài liệu tham khảo

10.1093/bioinformatics/btv688

10.1093/nar/gkv1046

10.1038/nbt.3238

10.1101/064352

2002, A software package for drawing ideograms automatically, Online J Bioinformatics, 1, 51

2013, Optimal assembly for high throughput shotgun sequencing, BMC Bioinformatics, 14, S18, 10.1186/1471-2105-14-S5-S18

Broder AZ . 1997. On the resemblance and containment of documents. In Sequences ’97: Proceedings of the Compression and Complexity of Sequences, pp. 21–29. IEEE Computer Society, Washington, DC.

2000, Min-wise independent permutations, J Comput Syst Sci, 60, 630, 10.1006/jcss.1999.1690

10.1038/nbt.2727

10.1093/nar/gkw654

10.1038/nmeth.2474

10.1038/nmeth.4035

2008, Near duplicate image detection: min-hash and tf-idf weighting, BMVC, 810, 812

10.1126/science.1162986

10.1101/gr.8.3.186

10.1093/bioinformatics/18.suppl_1.S294

1999, Assessing the quality of the DNA sequence from the Human Genome Project, Genome Res, 9, 1, 10.1101/gr.9.1.1

10.1101/gr.191395.115

10.1126/science.aae0344

10.1093/bioinformatics/btu392

10.1371/journal.pone.0055864

10.1126/science.1076181

10.1101/gr.185579.114

10.1038/nature03001

10.1101/066613

10.1073/pnas.0307971100

10.1038/nmeth.3290

Judge K , Hunt M , Reuter S , Tracey A , Quail MA , Parkhill J , Peacock SJ . 2016. Comparison of bacterial genome assembly software for MinION data and their applicability to medical microbiology. Microb Genomics 2.

10.1038/nbt.2768

10.1038/sdata.2014.45

2014, One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly, Curr Opin Microbiol, 23C, 110

10.1093/bioinformatics/btr520

10.1038/nbt.2280

10.1186/gb-2013-14-9-r101

10.1186/gb-2004-5-2-r12

10.1101/006395

10.1093/bioinformatics/btw152

10.1038/nmeth.3444

2008, Scoring, term weighting and the vector space model, Introduction to information retrieval, 100, 2

10.1002/jez.1400170103

10.1093/bioinformatics/btn548

10.1007/BF01840446

2005, The fragment assembly string graph, Bioinformatics, 21, i79, 10.1093/bioinformatics/bti1114

2014, Efficient local alignment discovery amongst noisy long reads, Algorithms in bioinformatics. WABI 2014. Lecture notes in computer science, 8701, 52

10.1126/science.287.5461.2196

10.1089/cmb.2009.0005

10.1093/bioinformatics/btt502

10.1101/gr.194201

2016, Mash: fast genome and metagenome distance estimation using MinHash, Genome Biol, 17, 132, 10.1186/s13059-016-0997-x

10.1093/bioinformatics/bts649

10.1186/gb-2008-9-3-r55

2012, The bonobo genome compared with the chimpanzee and human genomes, Nature, 486, 527, 10.1038/nature11128

10.1186/gb-2013-14-5-r51

10.1093/bioinformatics/btu538

10.1101/gr.131383.111

10.1038/nature02390

10.1038/nbt.2181

10.1038/nbt.2728

2016, Long-read sequencing and de novo assembly of a Chinese genome, Nat Commun, 7, 12065, 10.1038/ncomms12065

10.1093/bioinformatics/btw237

Stevens NM . 1912. The chromosomes in Drosophila ampelophila. In Proceedings of the 7th International Zoological Congress, pp. 380–381. The University Press, Cambridge.

10.1089/gst.1995.1.9

2017, An improved genome assembly uncovers prolific tandem repeats in Atlantic cod, BMC Genomics, 18, 95, 10.1186/s12864-016-3448-x

10.1093/nar/gkq543

10.1016/0304-3975(92)90143-4

10.1101/gr.214270.116

10.1371/journal.pone.0112963

10.1126/science.aaf7501

10.1093/bioinformatics/btv383

10.1038/srep31900

10.1101/gr.074492.107

10.1038/nbt.3432