Fast comparison of genomic and meta-genomic reads with alignment-free measures based on quality values

BMC Medical Genomics - Tập 9 - Trang 41-50 - 2016
Matteo Comin1, Michele Schimd1
1Department of Information Engineering, University of Padova, Padova, Italy

Tóm tắt

Sequencing technologies are generating enormous amounts of read data, however assembly of genomes and metagenomes remain among the most challenging tasks. In this paper we study the comparison of genomes and metagenomes only based on read data, using word counts statistics called alignment-free thus not requiring reference genomes or assemblies. Quality scores produced by sequencing platforms are fundamental for various analyses, moreover future-generation sequencing platforms, will produce longer reads but with error rate around 15 %. In this context it will be fundamental to exploit quality values information within the framework of alignment-free measures. In this paper we present a family of alignment-free measures, called d q -type, that are based on k-mer counts and quality values. These statistics can be used to compare genomes and metagenomes based on their read sets. Results show that the evolutionary relationship of genomes can be reconstructed based on the direct comparison of theirs reads sets. The use of quality values on average improves the classification accuracy, and its contribution increases when the reads are more noisy. Also the comparison of metagenomic microbial communities can be performed efficiently. Similar metagenomes are quickly detected, just by processing their read data, without the need of costly alignments.

Tài liệu tham khảo

Medini D, Serruto D, Parkhill J, Relman D, Donati C, Moxon R, et al. Microbiology in the post-genomic era. Nat Rev Microbiol. 2008; 6(6):419–30. Jothi R, Cuddapah S, Barski A, Cui K, Zhao K. Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Res. 2008; 36(16):5221–31. Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008; 18(15):821–9. Schatz MC, Witkowski J, McCombie WR. Current challenges in de novo plant genome sequencing and assembly. Genome Biol. 2012; 13(4):243. Zeller G, Tap J, Voigt A, Sunagawa S, Kultima J, Costea P, et al. Potential of fecal microbiota for early-stage detection of colorectal cancer. Mol Syst Biol. 2014; 10(11):766. Wang Y, Leung HC, Yiu SM, Chin FY. MetaCluster-TA: taxonomic annotation for metagenomic data based on assembly-assisted binning. BMC Genomics. 2014; 15(Suppl 1):S12. Segata N, Börnigen D, Morgan XC, Huttenhower C. PhyloPhlAn is a new method for improved phylogenetic and taxonomic placement of microbes. Nat Commun. 2013; 4:2304. Song K, Ren J, Zhai Z, Liu X, Deng M, Sun F. Alignment-free sequence comparison based on next-generation sequencing reads. J Comput Biol. 2013; 20(2):64–79. Comin M, Schimd M. Assembly-free genome comparison based on next-generation sequencing reads and variable length patterns. BMC Bioinformatics. 2014; 15(9):S1. Vinga S, Almeida J. Alignment-free sequence comparison – a review. Bioinformatics. 2001; 19(4):513–23. Gregory ES, Se-Ran J, Guohong AW, Sung-Hou K. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. PNAS. 2009; 106(8):2677–82. Comin M, Verzotto D. Whole-genome phylogeny by virtue of unic subwords. In: Proc. 23rd Int. Workshop on Database and Expert Systems Applications (DEXA-BIOKDD’12). Vienna, Austria: IEEE: 2012. p. 190–4. Comin M, Verzotto D. Alignment-free phylogeny of whole genomes using underlying subwords. BMC Algorithms Mol Biol. 2012; 7(1):1–12. Kantorovitz MR, Robinson GE, Sinha S. A statistical method for alignment-free comparison of regulatory sequences. Bioinformatics. 2007; 23(13):249–55. Comin M, Verzotto S. Beyond fixed-resolution alignment-free measures for mammalian enhancers sequence comparison. Proc Twelfth Asia Pacific Bioinformatics Conference IEEE/ACM Trans Comput Biol Bioinformatics. 2014; 11(4):628–637. Comin C, Antonello M. Fast computation of entropic profiles for the detection of conservation in Genomes. Proc Pattern Recognit Bioinformatics PRIB Lecture Notes in Bioinformatics. 2013; 7986:277–88. Comin M, Antonello M. Fast Entropic Profiler: An information theoretic approach for the discovery of patterns in Genomes. IEEE/ACM Trans Comput Biol Bioinformatics. 2014; 11(3):500–9. Comin M, Antonello M. Fast alignment-free comparison for regulatory sequencesusing multiple resolution entropic profiles. Proc Int Conf Bioinformatics Models Methods Algorithms. 2015:171–7. Heng L, Jue R, Durbin R. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 2008; 18(11):1851–8. Hashimoto WS, Morishita S. Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing. Genome Res. 2009; 19(7):1309–15. Albers C, Lunter G, MacArthur DG, McVean G, Ouwehand WH, Durbin R. Dindel: accurate indel calls from short-read data. Genome Res. 2011; 21(6):961–73. Carneiro MO, Russ C, Ross MG, Gabriel SB, Nusbaum C, DePristo MA. Pacific biosciences sequencing technology for genotyping and variation discovery in human data. BMC Genomics. 2012; 13(1):1–7. Comin M, Leoni A, Schimd M. QCluster: Extending Alignment-Free Measures with Quality Values for Reads Clustering. Proc WABI 2014 Lecture Notes Comput Sci. 2014; 8701:1–13. Comin M, Leoni A, Schimd M. Clustering of reads with alignment-free measures and quality values. BMC Algorithms Mol Biol. 2015; 10(4):1–10. Blaisdell BE. A measure of the similarity of sets of sequences not requiring sequence alignment. PNAS. 1986; 83(14):5155–9. Lippert RA, Huang HY, Waterman MS. Distributional regimes for the number of k-word matches between two random sequences. PNAS. 2002; 100(13):13980–9. Reinert G, Chew D, Sun F, Waterman MS. Alignment-free sequence comparison (I): statistics and power. J Comput Biol. 2009; 16(12):1615–34. Wan L, Reinert G, Chew D, Sun F, Waterman MS. Alignment-free sequence comparison (II): theoretical power of comparison statistics. J Comput Biol. 2010; 17(11):1467–90. Ewing B. Green, E. Genome Res. 1998; 8(3):186–94. Leimeister C, Morgenstern B. kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics. 2014; 30(14):2000–8. Solovyov A, Lipkin WI. Centroid based clustering of high throughput sequencing reads based on n-mer counts. BMC Bioinformatics. 2013; 14(1):1–21. Stoye J, Evers D, Meyer F. Rose: generating sequence families. Bioinformatics. 1998; 14(2):157–63. Holtgrewe M. Mason–a read simulator for second generation sequencing data. Technical Report FU Berlin 2010. http://publications.mi.fu-berlin.de/962/. Felsenstein J. Phylip-phylogeny inference package (version 3.2). Cladistics. 1989; 5:163–6. Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014; 15(3):R46.