Scalability and Validation of Big Data Bioinformatics Software

Computational and Structural Biotechnology Journal - Tập 15 - Trang 379-386 - 2017
Andrian Yang1,2, Michael Troup1, Joshua W.K. Ho1,2
1Victor Chang Cardiac Research Institute, Darlinghurst, NSW 2010, Australia
2St Vincent’s Clinical School, University of New South Wales, Darlinghurst, NSW 2010, Australia

Tài liệu tham khảo

Viceconti, 2015, Big data, big knowledge: big data for personalized healthcare, IEEE J Biomed Health Inform, 19, 1209, 10.1109/JBHI.2015.2406883 Baker, 2010, Next-generation sequencing: adjusting to data overload, Nat Methods, 7, 495, 10.1038/nmeth0710-495 Goodwin, 2016, Coming of age: ten years of next-generation sequencing technologies, Nat Rev Genet, 17, 333, 10.1038/nrg.2016.49 Yu, 2016, Single-cell transcriptome study as big data, Genomics Proteomics Bioinformatics, 14, 21, 10.1016/j.gpb.2016.01.005 Marx, 2013, Biology: the big challenges of big data, Nature, 498, 255, 10.1038/498255a Dolinski, 2015, Implications of Big Data for cell biology, Mol Biol Cell, 26, 2575, 10.1091/mbc.e13-12-0756 Alyass, 2015, From big data analysis to personalized medicine for all: challenges and opportunities, BMC Med Genomics, 8, 10.1186/s12920-015-0108-y Giannoulatou, 2014, Verification and validation of bioinformatics software without a gold standard: a case study of BWA and Bowtie, BMC Bioinf, 15, S15, 10.1186/1471-2105-15-S16-S15 Baruzzo, 2016, Simulation-based comprehensive benchmarking of RNA-seq aligners, Nat Methods, 14, 135, 10.1038/nmeth.4106 Mattmann, 2013, Computing: a vision for data science, Nature, 493, 473, 10.1038/493473a Schlosberg, 2016, Data security in genomics: a review of Australian privacy requirements and their relation to cryptography in data storage, J Pathol Inf, 7, 6, 10.4103/2153-3539.175793 2017 1994, Int J Supercomput Appl High Perform Eng, 8 Sunderam, 1990, PVM: a framework for parallel distributed computing, Concurr Pract Exp, 2, 315, 10.1002/cpe.4330020404 Darling, 2003, The design, implementation, and evaluation of mpiBLAST, Proc Clust, 2003, 13 Ebedes, 2004, Multiple sequence alignment in parallel on a workstation cluster, Bioinformatics, 20, 1193, 10.1093/bioinformatics/bth055 Foster, 2002, The Grid: a new infrastructure for 21st century science, Phys Today, 55, 42, 10.1063/1.1461327 Foster, 2005, Globus Toolkit Version 4: software for service-oriented systems, 3779/2005, 2 Krishnan, 2005, GridBLAST: a Globus-based high-throughput implementation of BLAST in a Grid computing framework, Concurr Comput Pract Exp, 17, 1607, 10.1002/cpe.906 Stevens, 2003, myGrid: personalised bioinformatics on the information grid, Bioinformatics, 19, i302, 10.1093/bioinformatics/btg1041 Carvalho, 2005, Squid – a simple bioinformatics grid, BMC Bioinf, 6, 197, 10.1186/1471-2105-6-197 Charalambous, 2005, Initial experiences porting a bioinformatics application to a graphics processor, 3746, 415 Buck, 2004, Brook for GPUs: stream computing on graphics hardware, ACM Trans Graph, 23, 777, 10.1145/1015706.1015800 Liu, 2009, CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units, BMC Res Notes, 2, 73, 10.1186/1756-0500-2-73 Nickolls, 2008, Scalable parallel programming with CUDA, Queue, 6, 40, 10.1145/1365490.1365500 Mell, 2011, The NIST definition of cloud computing, NIST Spec Publ, 145, 7 2017 2017 2017 2017 2017 Nguyen, 2011, CloudAligner: a fast and full-featured MapReduce based tool for sequence mapping, BMC Res Notes, 4, 171, 10.1186/1756-0500-4-171 Abuín, 2016, SparkBWA: speeding up the alignment of high-throughput DNA sequencing data, PLoS One, 11, e0155461, 10.1371/journal.pone.0155461 Decap, 2015, Halvade: scalable sequence analysis with MapReduce, Bioinformatics, 31, 2482, 10.1093/bioinformatics/btv179 Kelly, 2015, Churchill: an ultra-fast, deterministic, highly scalable and balanced parallelization strategy for the discovery of human genetic variation in clinical and population-scale genomics, Genome Biol, 16, 1, 10.1186/s13059-014-0577-x Sreedharan, 2014, Oqtans: the RNA-seq workbench in the cloud for complete and reproducible quantitative transcriptome analysis, Bioinformatics, 30, 1300, 10.1093/bioinformatics/btt731 Yang, 2016, Falco: a quick and flexible single-cell RNA-seq processing framework on the cloud, Bioinformatics, btw732, 10.1093/bioinformatics/btw732 Afgan, 2010, Galaxy CloudMan: delivering cloud compute clusters, BMC Bioinf, 11, S4, 10.1186/1471-2105-11-S12-S4 2017 Krampis, 2012, Cloud BioLinux: pre-configured and on-demand bioinformatics computing for the genomics community, BMC Bioinf, 13, 42, 10.1186/1471-2105-13-42 Angiuoli, 2011, CloVR: a virtual machine for automated and portable sequence analysis from the desktop using cloud computing, BMC Bioinf, 12, 356, 10.1186/1471-2105-12-356 Beaulieu-Jones, 2017, Reproducibility of computational workflows is automated using continuous analysis, Nat Biotechnol, 35, 342, 10.1038/nbt.3780 Dean, 2004, MapReduce: simplified data processing on large clusters Langmead, 2010, Cloud-scale RNA-sequencing differential expression analysis with Myrna, Genome Biol, Figure 1, 1 Zaharia, 2010, Spark: cluster computing with working sets, 10 O'Brien, 2015, VariantSpark: population scale clustering of genotype information, BMC Genomics, 16, 1052, 10.1186/s12864-015-2269-7 Blue, 2014, Targeted next-generation sequencing identifies pathogenic variants in familial congenital heart disease, J Am Coll Cardiol, 64, 2498, 10.1016/j.jacc.2014.09.048 Bennett, 2014, Next-generation sequencing in clinical oncology: next steps towards clinical validation, Cancer, 6, 2296, 10.3390/cancers6042296 O'Rawe, 2013, Low concordance of multiple variant-calling pipelines: practical implications for exome and genome sequencing, Genome Med, 5, 28, 10.1186/gm432 Ho, 2011, ChIP-chip versus ChIP-seq: lessons for experimental design and data analysis, BMC Genomics, 12, 10.1186/1471-2164-12-134 Jung, 2014, Impact of sequencing depth in ChIP-seq experiments, Nucleic Acids Res, 42, e74, 10.1093/nar/gku178 Wilbanks, 2010, Evaluation of algorithm performance in ChIP-Seq peak detection, PLoS One, 5, e11471, 10.1371/journal.pone.0011471 Yu, 2016, Comparing five statistical methods of differential methylation identification using bisulfite sequencing data, Stat Appl Genet Mol Biol, 15, 10.1515/sagmb-2015-0078 Allen, 2015, Variant calling assessment using Platinum Genomes, NIST Genome in a Bottle, and VCAT 2.0 Sanger, 1977, DNA sequencing with chain-terminating inhibitors, Proc Natl Acad Sci, 74, 5463, 10.1073/pnas.74.12.5463 Huang, 2012, ART: a next-generation sequencing read simulator, Bioinformatics, 28, 593, 10.1093/bioinformatics/btr708 Mu, 2015, VarSim: a high-fidelity simulation and validation framework for high-throughput genome sequencing with cancer applications, Bioinformatics, 31, 1469, 10.1093/bioinformatics/btu828 Li, 2009, The Sequence Alignment/Map format and SAMtools, Bioinformatics, 25, 2078, 10.1093/bioinformatics/btp352 Myers, 2011 Weyuker, 1982, On testing non-testable programs, Comput J, 25, 465, 10.1093/comjnl/25.4.465 Kamali, 2015, How to test bioinformatics software?, Biophys Rev, 7, 343, 10.1007/s12551-015-0177-3 Chen, 1998, Metamorphic testing: a new approach for generating next test cases Xie, 2011, Testing and validating machine learning classifiers by metamorphic testing, J Syst Softw, 84, 544, 10.1016/j.jss.2010.11.920 Liu, 2014, How effectively does metamorphic testing alleviate the Oracle problem?, IEEE Trans Softw Eng, 40, 4, 10.1109/TSE.2013.46 Sun, 2012, A metamorphic relation-based approach to testing web services without oracles, Int J Web Serv Res, 9, 51, 10.4018/jwsr.2012010103 Tao, 2010, An automatic testing approach for compiler based on metamorphic testing technique, 270 Segura, 2011, Automated metamorphic testing on the analyses of feature models, Inf Softw Technol, 53, 245, 10.1016/j.infsof.2010.11.002 Chen, 2002, Metamorphic testing of programs on partial differential equations: a case study, 327 Troup, 2016, A cloud-based framework for applying metamorphic testing to a bioinformatics pipeline, 33 Chen, 2009, An innovative approach for testing bioinformatics programs using metamorphic testing, BMC Bioinf, 10, 24, 10.1186/1471-2105-10-24 Heath, 2015, Single-cell analysis tools for drug discovery and development, Nat Rev Drug Discov, 15, 204, 10.1038/nrd.2015.16 Rotem, 2015, Single-cell ChIP-seq reveals cell subpopulations defined by chromatin state, Nat Biotechnol, 33, 1165, 10.1038/nbt.3383 Ellingson, 2014, High-throughput virtual molecular docking with AutoDockCloud: high-throughput virtual molecular docking with AutoDockCloud, Concurr Comput Pract Exp, 26, 907, 10.1002/cpe.2926 Feng, 2011, PeakRanger: a cloud-enabled peak caller for ChIP-seq data, BMC Bioinf, 12, 139, 10.1186/1471-2105-12-139