STAR: ultrafast universal RNA-seq aligner
Tóm tắt
Motivation: Accurate alignment of high-throughput RNA-seq data is a challenging and yet unsolved problem because of the non-contiguous transcript structure, relatively short read lengths and constantly increasing throughput of the sequencing technologies. Currently available RNA-seq aligners suffer from high mapping error rates, low mapping speed, read length limitation and mapping biases.
Results: To align our large (>80 billon reads) ENCODE Transcriptome RNA-seq dataset, we developed the Spliced Transcripts Alignment to a Reference (STAR) software based on a previously undescribed RNA-seq alignment algorithm that uses sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure. STAR outperforms other aligners by a factor of >50 in mapping speed, aligning to the human genome 550 million 2 × 76 bp paired-end reads per hour on a modest 12-core server, while at the same time improving alignment sensitivity and precision. In addition to unbiased de novo detection of canonical junctions, STAR can discover non-canonical splices and chimeric (fusion) transcripts, and is also capable of mapping full-length RNA sequences. Using Roche 454 sequencing of reverse transcription polymerase chain reaction amplicons, we experimentally validated 1960 novel intergenic splice junctions with an 80–90% success rate, corroborating the high precision of the STAR mapping strategy.
Availability and implementation: STAR is implemented as a standalone C++ code. STAR is free open source software distributed under GPLv3 license and can be downloaded from http://code.google.com/p/rna-star/.
Contact: [email protected].
Từ khóa
Tài liệu tham khảo
Au, 2010, Detection of splice junctions from paired-end RNA-seq data by SpliceMap, Nucleic Acids Res., 38, 4570, 10.1093/nar/gkq211
Darling, 2004, Mauve: multiple alignment of conserved genomic sequence with rearrangements, Genome Res., 14, 1394, 10.1101/gr.2289704
Darling, 2010, progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement, PLoS One, 5, e11147, 10.1371/journal.pone.0011147
De Bona, 2008, Optimal spliced alignments of short sequence reads, Bioinformatics, 24, i174, 10.1093/bioinformatics/btn300
Delcher, 2002, Fast algorithms for large-scale genome alignment and comparison, Nucleic Acids Res., 30, 2478, 10.1093/nar/30.11.2478
Flusberg, 2010, Direct detection of DNA methylation during single-molecule, real-time sequencing, Nat. Methods, 7, 461, 10.1038/nmeth.1459
Grant, 2011, Comparative analysis of RNA-Seq alignment algorithms and the RNA-Seq unified mapper (RUM), Bioinformatics, 27, 2518, 10.1093/bioinformatics/btr427
Han, 2011, Pre-mRNA splicing: where and when in the nucleus, Trends Cell. Biol., 21, 336, 10.1016/j.tcb.2011.03.003
Harrow, 2012, GENCODE: The reference human genome annotation for the ENCODE project, Genome Res., 22, 1760, 10.1101/gr.135350.111
Hastings, 2001, Pre-mRNA splicing in the new millennium, Curr. Opin. Cell. Biol., 13, 302, 10.1016/S0955-0674(00)00212-X
Kurtz, 2004, Versatile and open software for comparing large genomes, Genome Biol., 5, R12, 10.1186/gb-2004-5-2-r12
Kent, 2002, BLAT–the BLAST-like alignment tool., Genome Res., 12, 656
Landt, 2012, ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia, Genome Res., 22, 1813, 10.1101/gr.136184.111
Manber, 1993, Suffix arrays—a new method for online string searches, SIAM J. Comput., 22, 935, 10.1137/0222058
Parkhomchuk, 2009, Transcriptome analysis by strand-specific sequencing of complementary DNA, Nucleic Acids Res., 37, e123, 10.1093/nar/gkp596
Rothberg, 2011, An integrated semiconductor device enabling non-optical genome sequencing, Nature, 475, 348, 10.1038/nature10242
Trapnell, 2009, TopHat: discovering splice junctions with RNA-Seq, Bioinformatics, 25, 1105, 10.1093/bioinformatics/btp120
Wang, 2010, MapSplice: accurate mapping of RNA-seq reads for splice junction discovery, Nucleic Acids Res., 38, e178, 10.1093/nar/gkq622
Wu, 2010, Fast and SNP-tolerant detection of complex variants and splicing in short reads, Bioinformatics, 26, 873, 10.1093/bioinformatics/btq057