Gigwa—Genotype investigator for genome-wide analyses
Tóm tắt
Exploring the structure of genomes and analyzing their evolution is essential to understanding the ecological adaptation of organisms. However, with the large amounts of data being produced by next-generation sequencing, computational challenges arise in terms of storage, search, sharing, analysis and visualization. This is particularly true with regards to studies of genomic variation, which are currently lacking scalable and user-friendly data exploration solutions. Here we present Gigwa, a web-based tool that provides an easy and intuitive way to explore large amounts of genotyping data by filtering it not only on the basis of variant features, including functional annotations, but also on genotype patterns. The data storage relies on MongoDB, which offers good scalability properties. Gigwa can handle multiple databases and may be deployed in either single- or multi-user mode. In addition, it provides a wide range of popular export formats. The Gigwa application is suitable for managing large amounts of genomic variation data. Its user-friendly web interface makes such processing widely accessible. It can either be simply deployed on a workstation or be used to provide a shared data portal for a given community of researchers.
Tài liệu tham khảo
Gheyas A, Boschiero C, Eory L, Ralph H, Kuo R, Woolliams J, et al. Functional classification of 15 million SNPs detected from diverse chicken populations. DNA Res. 2015;22(3):205–17.
Li X, Buitenhuis AJ, Lund MS, Li C, Sun D, Zhang Q, et al. Joint genome-wide association study for milk fatty acid traits in Chinese and Danish Holstein populations. J Dairy Sci. 2015;98(11):8152–63. Available from: http://www.ncbi.nlm.nih.gov/pubmed/26364108.
Shinada H, Yamamoto T, Sato H, Yamamoto E, Hori K, Yonemaru J, et al. Quantitative trait loci for rice blast resistance detected in a local rice breeding population by genome-wide association mapping. Breed Sci. 2015;65(5):388–95. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=4671699&tool=pmcentrez&rendertype=abstract.
Marcotuli I, Houston K, Waugh R, Fincher GB, Burton RA, Blanco A, et al. Genome wide association mapping for arabinoxylan content in a collection of tetraploid wheats. PLoS One. 2015;10(7):e0132787. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=4503733&tool=pmcentrez&rendertype=abstract.
The 3000 rice genomes project. The 3,000 rice genomes project. Gigascience. 2014; 3:7. http://dx.doi.org/10.1186/2047-217X-3-7
Ossowski S, Schneeberger K, Clark RM, Lanz C, Warthmann N, Weigel D. Sequencing of natural strains of Arabidopsis thaliana with short reads. Genome Res. 2008;18:2024–33. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2593571&tool=pmcentrez&rendertype=abstract.
Cao J, Schneeberger K, Ossowski S, Günther T, Bender S, Fitz J, et al. Whole-genome sequencing of multiple Arabidopsis thaliana populations. Nat Genet. 2011;43(10):956–63. Available from: http://www.ncbi.nlm.nih.gov/pubmed/21874002.
Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call format and VCFtools. Bioinformatics. 2011;27(15):2156–8. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3137218&tool=pmcentrez&rendertype=abstract.
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010;20(9):1297–303. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2928508&tool=pmcentrez&rendertype=abstract.
Casbon J. PyVCF - A Variant Call Format Parser for Python. 2012. Available from: https://pyvcf.readthedocs.org/en/latest/INTRO.html
Obenchain V, Lawrence M, Carey V, Gogarten S, Shannon P, Morgan M. VariantAnnotation: a bioconductor package for exploration and annotation of genetic variants. Bioinformatics. 2014;30(14):2076–8.
Wittelsburger U, Pfeifer B, Lercher MJ. WhopGenome: high-speed access to whole-genome variation and sequence data in R. Bioinformatics. 2015;31(3):413–5. Available from: http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/btu636.
Bach M, Werner A. In: Nawrat MAM, editor. Innovative control systems for tracked vehicle platforms, vol. 2. Cham: Springer International Publishing; 2014. p. 163–74. Available from: http://link.springer.com/10.1007/978-3-319-04624-2.
Gajendran, S.K. A survey on NoDQL databases. University of Illinois; 2012. Available from: http://www.masters.dgtu.donetsk.ua/2013/fknt/babich/library/article10.pdf.
Moniruzzaman ABM, Hossain SA. Nosql database: New era of databases for big data analytics-classification, characteristics and comparison. CoRR [Internet]. 2013;6(4):1–14. Available from: http://arxiv.org/abs/1307.0191.
O’Connor BD, Merriman B, Nelson SF. SeqWare query engine: storing and searching sequence data in the cloud. BMC Bioinf. 2010;11(12):S2. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3040528&tool=pmcentrez&rendertype=abstract.
Wang S, Pandis I, Wu C, He S, Johnson D, Emam I, et al. High dimensional biological data retrieval optimization with NoSQL technology. BMC Genomics. 2014;15(8):S3. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=4248814&tool=pmcentrez&rendertype=abstract.
Langmead B, Schatz MC, Lin J, Pop M, Salzberg SL. Searching for SNPs with cloud computing. Genome Biol. 2009;10(11):R134. http://genomebiology.com/2009/10/11/R134.
Afgan E, Chapman B, Taylor J. CloudMan as a platform for tool, data, and analysis distribution. BMC Bioinf. 2012;13(1):315. http://www.biomedcentral.com/1471-2105/13/315.
Schatz MC. CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics. 2009;25(11):1363–9. Available from: http://bioinformatics.oxfordjournals.org/cgi/doi/10.1093/bioinformatics/btp236.
Russ TA, Ramakrishnan C, Hovy EH, Bota M, Burns GAPC. Knowledge engineering tools for reasoning with scientific observations and interpretations: a neural connectivity use case. BMC Bioinf. 2011;12(1):351. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3176268&tool=pmcentrez&rendertype=abstract.
Ye Z, Li S. Arequest skewaware heterogeneous distributed storage systembased on Cassandra. the International Conference on Computer and Management (CAMAN’11). 2011. p. 1–5.
Manyam G, Payton M A, Roth J A, Abruzzo L V, Coombes KR. Relax with CouchDB - Into the non-relational DBMS era of bioinformatics. Genomics. Elsevier Inc.; 2012. Available from: http://www.ncbi.nlm.nih.gov/pubmed/22609849. Accessed 19 Dec 2015.
Ohyanagi H, Ebata T, Huang X, Gong H, Fujita M, Mochizuki T, et al. OryzaGenome : Genome Diversity Database of Wild Oryza Species Special Online Collection – Database Paper. 2016;0(November 2015):1–7
Alexandrov N, Tai S, Wang W, Mansueto L, Palis K, Fuentes RR, et al. SNP-Seek database of SNPs derived from 3000 rice genomes. Nucleic Acids Res. 2015;63(2):2–6.
Miller C, Qiao Y, DiSera T, D’Astous B, Marth G. Bam. Iobio: a Web-based, real-time, sequence alignment file inspector. Nat Methods. 2014;11(12):1189.
Di Sera TL. vcf.iobio—A visually driven variant data inspector and real-time analysis web application. NEXT GEN SEEK. 2015. Available from: http://vcf.iobio.io/. Accessed 19 Dec 2015.
Skinner ME, Uzilov AV, Stein LD, Mungall CJ, Holmes IH. JBrowse: a next-generation genome browser. Genome Res. 2009;19:1630–8. Available from: http://genome.cshlp.org/content/19/9/1630.short.
MongoDB Inc. MongoDB. 2015. Available from: https://www.mongodb.org/
VCF 4.2 specification. 2015. Available from: https://samtools.github.io/hts-specs/VCFv4.2.pdf
Cingolani P, Platts A, Wang LL, Coon M, Nguyen T, Wang L, et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of drosophila melanogaster strain w 1118; iso-2; iso-3. Fly (Austin). 2012;6(June):80–92.
Giardine B, Riemer C, Hardison RC, Burhans R, Elnitski L, Shah P, et al. Galaxy: a platform for interactive large-scale genome analysis. Genome Res. 2005;15(10):1451–5. Available from: http://www.ncbi.nlm.nih.gov/pubmed/16169926.
Thorvaldsdóttir H, Robinson JT, Mesirov JP. Integrative genomics viewer (IGV): high-performance genomics data visualization and exploration. Brief Bioinform. 2013;14(2):178–92. Available from: http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3603213&tool=pmcentrez&rendertype=abstract.
Pivotal Software Inc. Java Spring Framework. 2015. Available from: http://projects.spring.io/spring-framework/
The jQuery Foundation. JQuery. 2015. Available from: https://jquery.com/
The Broad Institute. SamTools API. Available from: https://samtools.github.io/htsjdk/
Highsoft. Highcharts API. Available from: http://www.highcharts.com/products/highcharts. Accessed 19 Dec 2015.
IRRI. 3,000 Rice genomes datasets. 2015. Available from: http://oryzasnp-atcg-irri-org.s3-website-ap-southeast-1.amazonaws.com/. Accessed 19 Dec 2015.
Oracle. MySQL. 2015. Available from: http://dev.mysql.com/
Docker. 2015. Available from: https://www.docker.com/
Platform as a Service. Available from: https://en.wikipedia.org/wiki/Paas
South Green Bioinformatic Platform. Gigwa code repository. 2015. Available from: https://github.com/SouthGreenPlatform/gigwa
Sempere, G; Philippe, F; Dereeper, A; Ruiz, M; Sarah, G; Larmande, P. Supporting information for “Gigwa - Genotype Investigator for Genome Wide Analyses”. GigaScience Database. 2016. http://dx.doi.org/10.5524/100199