Scaffolder - software for manual genome scaffolding
Tóm tắt
The assembly of next-generation short-read sequencing data can result in a fragmented non-contiguous set of genomic sequences. Therefore a common step in a genome project is to join neighbouring sequence regions together and fill gaps. This scaffolding step is non-trivial and requires manually editing large blocks of nucleotide sequence. Joining these sequences together also hides the source of each region in the final genome sequence. Taken together these considerations may make reproducing or editing an existing genome scaffold difficult. The software outlined here, “Scaffolder,” is implemented in the Ruby programming language and can be installed via the RubyGems software management system. Genome scaffolds are defined using YAML - a data format which is both human and machine-readable. Command line binaries and extensive documentation are available. This software allows a genome build to be defined in terms of the constituent sequences using a relatively simple syntax. This syntax further allows unknown regions to be specified and additional sequence to be used to fill known gaps in the scaffold. Defining the genome construction in a file makes the scaffolding process reproducible and easier to edit compared with large FASTA nucleotide sequences. Scaffolder is easy-to-use genome scaffolding software which promotes reproducibility and continuous development in a genome project. Scaffolder can be found at
http://next.gs
.
Tài liệu tham khảo
Miller JR, Koren S, Sutton G: Assembly algorithms for next-generation sequencing data. Genomics. 2010, 95 (6): 315-327. 10.1016/j.ygeno.2010.03.001.
Pop M, Salzberg SL: Bioinformatics challenges of new sequencing technology. Trends Genet. 2008, 24 (3): 142-149. 10.1016/j.tig.2007.12.006.
Pop M: Genome assembly reborn: recent computational challenges. Brief Bioinform. 2009, 10 (4): 354-366. 10.1093/bib/bbp026.
Branscomb E, Predki P: On the high value of low standards. J Bacteriol. 2002, 184 (23): 6406-6409. 10.1128/JB.184.23.6406-6409.2002.
Parkhill J: The importance of complete genome sequences. Trends Microbiol. 2002, 10 (5): 219-220. 10.1016/S0966-842X(02)02353-3.
Fraser CM, Eisen JA, Nelson KE, Paulsen IT, Salzberg SL: The value of complete microbial genome sequencing (you get what you pay for). J Bacteriol. 2002, 184 (23): 6403-6405. 10.1128/JB.184.23.6403-6405.2002.
Nagarajan N, Cook C, Di Bonaventura M, Ge H, Richards A, Bishop-Lilly KA, DeSalle R, Read TD, Pop M: Finishing genomes with limited resources: lessons from an ensemble of microbial genomes. BMC Genomics. 2010, 11: 242+-10.1186/1471-2164-11-242.
Gordon D, Desmarais C, Green P: Automated finishing with autofinish. Genome Res. 2001, 11 (4): 614-625. 10.1101/gr.171401.
Richter DC, Schuster SC, Huson DH: OSLay: optimal syntenic layout of unfinished assemblies. Bioinformatics. 2007, 23 (13): 1573-1579. 10.1093/bioinformatics/btm153.
Zhao F, Zhao F, Li T, Bryant DA: A new pheromone trail-based genetic algorithm for comparative genome assembly. Nucleic Acids Res. 2008, 36 (10): 3455-3462. 10.1093/nar/gkn168.
Assefa S, Keane TM, Otto TD, Newbold C, Berriman M: ABACAS: algorithm-based automatic contiguation of assembled sequences. Bioinformatics (Oxford, England). 2009, 25 (15): 1968-1969. 10.1093/bioinformatics/btp347.
Mulyukov Z, Pevzner PA: EULER-PCR: finishing experiments for repeat resolution. Pac Symp Biocomput. 2002, 7: 199-210.
Koren S, Miller JR, Walenz BP, Sutton G: An algorithm for automated closure during assembly. BMC Bioinforma. 2010, 11: 457+-10.1186/1471-2105-11-457.
Tsai IJ, Otto TD, Berriman M: Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps. Genome Biol. 2010, 11 (4): R41+.
Dayarian A, Michael TP, Sengupta AM: SOPRA: Scaffolding algorithm for paired reads via statistical optimization. BMC Bioinforma. 2010, 11: 345+-10.1186/1471-2105-11-345.
Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W: Scaffolding pre-assembled contigs using SSPACE. Bioinformatics. 2011, 27 (4): 578-579. 10.1093/bioinformatics/btq683.
Pop M, Kosack DS, Salzberg SL: Hierarchical scaffolding with Bambus. Genome Res. 2004, 14: 149-159.
Matsumoto Y: The Ruby Programming Language. [http://www.ruby-lang.org/]
Chelimsky D, Astels D, Helmk B, North D, Dennis Z, Hellesoy A: The RSpec Book: Behaviour Driven Development with Rspec. 2010, Friends. Pragmatic Bookshelf, Cucumber
Segal L: YARD: A Ruby Documentation Tool. http://yardoc.org/.
Tomayko R: Ronn manual page authoring tool.http://rtomayko.github.com/ronn/.
Goto N, Prins P, Nakao M, Bonnal R, Aerts J, Katayama T: BioRuby: bioinformatics software for the Ruby programming language. Bioinformatics (Oxford, England). 2010, 26 (20): 2617-2619. 10.1093/bioinformatics/btq475.
Evans CC: YAML: a human friendly data serialization standard for all programming languages.http://www.yaml.org/.
Kuwata Lab: Kwalify: schema validator and data binding for YAML/JSON.http://www.kuwata-lab.com/kwalify/.