SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data

Oxford University Press (OUP) - Tập 7 Số 1 - 2018
Yuxin Chen1, Yongsheng Chen2, Chunmei Shi3,4,5, Zhibo Huang1, Yong Zhang1,6, Shengkang Li1,6, Yan Li1, Jia Ye1, Chang Ho Yu7, Zhuo Li8,9, Xiuqing Zhang1, Wei Wang1,10, Huanming Yang1,10, Lin Fang1,6, Qiang Chen3,4,5
1BGI-Shenzhen, Shenzhen 518083
2Geneplus-Beijing, Beijing 102206
3Department of Oncology, Fujian Medical University Union Hospital, Fuzhou 350001
4Department of Stem Cell Research Institute, Fujian Medical University Stem Cell Research Institute, Fuzhou 350000
5Fujian Key Laboratory of Translational Cancer Medicine, Fuzhou 350014
6Collaborative Innovation Center of High Performance Computing, National University of Defense Technology, Changsha 410073
7Intel China Ltd., Shanghai 200336
8Department of Surgery, Faculty of Medicine, The Chinese University of Hong Kong, Hong Kong
9Guangdong Provincial Hospital of Chinese Medicine, Guangzhou 510120
10James D. Watson Institute of Genome Sciences, Hangzhou 310058, China

Tóm tắt

Từ khóa


Tài liệu tham khảo

Fox, 2009, Applications of ultra-high-throughput sequencing, Methods Mol Biol, 553, 79, 10.1007/978-1-60327-563-7_5

Soon, 2014, High-throughput sequencing for biology and medicine, Mol Syst Biol, 9, 640-, 10.1038/msb.2012.61

Stephens, 2015, Big data: astronomical or genomical?, PLoS Biol, 13, e1002195, 10.1371/journal.pbio.1002195

Guo, 2014, Three-stage quality control strategies for DNA re-sequencing data, Brief Bioinformatics, 15, 879, 10.1093/bib/bbt069

Zhou, 2014, Prevention, diagnosis and treatment of high-throughput sequencing data pathologies, Mol Ecol, 23, 1679, 10.1111/mec.12680

Schmieder, 2011, Quality control and preprocessing of metagenomic datasets, Bioinformatics, 27, 863, 10.1093/bioinformatics/btr026

Moxon, 2008, A toolkit for analysing large-scale plant small RNA datasets, Bioinformatics, 24, 2252, 10.1093/bioinformatics/btn428

Gordon

Cox, 2010, SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data, BMC Bioinformatics, 11, 485, 10.1186/1471-2105-11-485

Zhang, 2011, BIGpre: a quality assessment package for next-generation sequencing data, Genomics Proteomics Bioinformatics, 9, 238, 10.1016/S1672-0229(11)60027-2

Aronesty, 2011, ea-utils: Command-Line Tools for Processing Biological Sequencing Data

Yang, 2013, HTQC: a fast quality control toolkit for Illumina sequencing data, BMC Bioinformatics, 14, 33, 10.1186/1471-2105-14-33

Li, seqtk: toolkit for processing sequences in FASTA/Q formats

Zhou, 2013, QC-Chain: fast and holistic quality control method for next-generation sequencing data, PLoS One, 8, e60234, 10.1371/journal.pone.0060234

Zhou, 2014, Meta-QC-Chain: comprehensive and fast quality control method for metagenomic data, Genomics Proteomics Bioinformatics, 12, 52, 10.1016/j.gpb.2014.01.002

Patel, 2012, NGS QC Toolkit: a toolkit for quality control of next generation sequencing data, PLoS One, 7, e30619, 10.1371/journal.pone.0030619

Simon, FastQC: a quality control tool for high throughput sequence data

Schmieder, 2010, TagCleaner: identification and removal of tag sequences from genomic and metagenomic datasets, BMC Bioinformatics, 11, 341, 10.1186/1471-2105-11-341

Falgueras, 2010, SeqTrim: a high-throughput pipeline for preprocessing any type of sequence reads, BMC Bioinformatics, 11, 38, 10.1186/1471-2105-11-38

St John, SeqPrep: tool for stripping adaptors and/or merging paired reads with overlap into single reads

Kong, 2011, Btrim: a fast, lightweight adapter and quality trimming program for next-generation sequencing technologies, Genomics, 98, 152, 10.1016/j.ygeno.2011.05.009

Lohse, 2012, RobiNA: a user-friendly, integrated software solution for RNA-seq-based transcriptomics, Nucleic Acids Res, 40, W622, 10.1093/nar/gks540

Martin, 2011, Cutadapt removes adapter sequences from high-throughput sequencing reads, EMBnet J, 17, pp, 10.14806/ej.17.1.200

Schubert, 2016, AdapterRemoval v2: rapid adapter trimming, identification, and read merging, BMC Res Notes, 9, 88, 10.1186/s13104-016-1900-2

Dodt, 2012, FLEXBAR-flexible barcode and adapter processing for next-generation sequencing platforms, Biology (Basel), 1, 895

Li, 2015, PEAT: an intelligent and efficient paired-end sequencing adapter trimming algorithm, BMC Bioinformatics, 16, S2, 10.1186/1471-2105-16-S1-S2

Bolger, 2014, Trimmomatic: a flexible trimmer for Illumina sequence data, Bioinformatics, 30, 2114, 10.1093/bioinformatics/btu170

Sturm, 2016, SeqPurge: highly-sensitive adapter trimming for paired-end NGS data, BMC Bioinformatics, 17, 208, 10.1186/s12859-016-1069-7

Jiang, 2014, Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads, BMC Bioinformatics, 15, 182, 10.1186/1471-2105-15-182

Chen, 2017, AfterQC: automatic filtering, trimming, error removing and quality control for fastq data, BMC Bioinformatics, 18, 80, 10.1186/s12859-017-1469-3

BUSHNELL, 2014, BBMap: A Fast, Accurate, Splice-Aware Aligner

Joshi, Sickle: A sliding-window, adaptive, quality-based trimming tool for FastQ files

Pertea, fqtrim: trimming&filtering of next-gen reads

Vince, Scythe: a Bayesian adapter trimmer

Leggett, 2014, NextClip: an analysis and read preparation tool for Nextera long mate pair libraries, Bioinformatics, 30, 566, 10.1093/bioinformatics/btt702

Criscuolo, 2013, AlienTrimmer: a tool to quickly and accurately trim off multiple short contaminant sequences from high-throughput sequencing reads, Genomics, 102, 500, 10.1016/j.ygeno.2013.07.011

Goecks, 2010, Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences, Genome Biol, 11, R86, 10.1186/gb-2010-11-8-r86

Team, 2013, R: A Language and Environment for Statistical Computing

Illumina, NextSeq 500 system overview

Huang, 2017, A reference human genome dataset of the BGISEQ-500 sequencer, Gigascience, 6, 1, 10.1093/gigascience/gix024

Zhang, 2013, Digital gene expression tag profiling analysis of the gene expression patterns regulating the early stage of mouse spermatogenesis, PLoS One, 8, e58680, 10.1371/journal.pone.0058680

Tam, 2015, Optimization of miRNA-seq data preprocessing, Brief Bioinformatics, 16, 950, 10.1093/bib/bbv019

Zook, 2016, Extensive sequencing of seven human genomes to characterize benchmark reference materials, Sci Data, 3, 160025, 10.1038/sdata.2016.25

GATK best practices

NISTv3.3.2, 2017, GIAB

Zhang, 2013, Digital gene expression tag profiling analysis of the gene expression patterns regulating the early stage of mouse spermatogenesis, PLoS One, 8, e58680, 10.1371/journal.pone.0058680

Zhou, 2010, Integrated profiling of microRNAs and mRNAs: microRNAs located on Xq27.3 associate with clear cell renal cell carcinoma, PLoS One, 5, e15224, 10.1371/journal.pone.0015224

Han, 2013, The suppression of WRKY44 by GIGANTEA-miR172 pathway is involved in drought response of Arabidopsis thaliana, PLoS One, 8, e73541, 10.1371/journal.pone.0073541

Hall, 2016, The cytoskeleton adaptor protein ankyrin-1 is upregulated by p53 following DNA damage and alters cell migration, Cell Death Dis, 7, e2184, 10.1038/cddis.2016.91

Surbanovski, 2016, A highly specific microRNA-mediated mechanism silences LTR retrotransposons of strawberry, Plant J, 85, 70, 10.1111/tpj.13090

Chen, 2017