Development of TBSPG Pipelines for Refining Unique Mapping and Repetitive Sequence Detection Using the Two Halves of Each Illumina Sequence Read

Heng Xiang1,2, Xiu-Qing Li2
1College of Animal Science and Technology, Southwest University, Beibei, China
2Potato Research Centre, Agriculture and Agri-Food Canada, Fredericton, Canada

Tóm tắt

We developed six pipelines (TBSPG) for mapping Illumina sequence reads to reference genomes, refining unique mapping, and computing the mapped read number and coverage. These pipelines provide the options of conducting multi-mapping or unique mapping, inputting with paired-end read files or a single-end read file, removing or not removing nucleus-organelle shared sequences, and mapping with the full-length reads or with the two halves of each read to refine the detection of unique and non-unique sequences. These TBSPG pipelines were based on (and named after) publicly available tools: Trimmomatic, the Burrows–Wheeler Aligner (BWA), SAMtools, Picard, and the Genome Analysis Toolkit (GATK). We developed several Perl scripts to fill the gaps between the tools, connect the tools, recognize half-length reads, select uniquely mapped reads, and compute and output data in a Microsoft Excel-recognizable format for studying the read number and the coverage per chromosome and organellar genome. In a potato 100-bp paired-end sequence file (Illumina TruSeq), approximately 6.75 % of uniquely mapped full-length reads were found to actually contain non-unique sequences at the half-length-read level. These freely available TBSPG pipelines can be used for many read-based applications, including repetitive sequence analysis and organellar genome copy number estimation.

Từ khóa


Tài liệu tham khảo

Li H, Durbin R (2009) Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25:1754–1760 Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R (2009) The sequence alignment/map format and SAMtools. Bioinformatics 25:2078–2079 Lohse M, Bolger AM, Nagel A, Fernie AR, Lunn JE, Stitt M, Usadel B (2012) RobiNA: a user-friendly, integrated software solution for RNA-Seq-based transcriptomics. Nucleic Acids Res 40:W622–W627 McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA (2010) The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 20:1297–1303 Xu X, Pan S, Cheng S, Zhang B, Mu D, Ni P, Zhang G, Yang S, Li R, Wang J et al (2011) Genome sequence and analysis of the tuber crop potato. Nature 475:189–195