BLAT—The BLAST-Like Alignment Tool

Genome Research - Tập 12 Số 4 - Trang 656-664 - 2002
W. James Kent1
1Department of Biology and Center for Molecular Biology of RNA, University of California-Santa Cruz, Santa Cruz, CA 95064, USA. [email protected]

Tóm tắt

Analyzing vertebrate genomes requires rapid mRNA/DNA and cross-species protein alignments. A new tool, BLAT, is more accurate and 500 times faster than popular existing tools for mRNA/DNA alignments and 50 times faster for protein alignments at sensitivity settings typically used when comparing vertebrate sequences. BLAT's speed stems from an index of all nonoverlapping K-mers in the genome. This index fits inside the RAM of inexpensive computers, and need only be computed once for each genome assembly. BLAT has several major stages. It uses the index to find regions in the genome likely to be homologous to the query sequence. It performs an alignment between homologous regions. It stitches together these aligned regions (often exons) into larger alignments (typically genes). Finally, BLAT revisits small internal exons possibly missed at the first stage and adjusts large gap boundaries that have canonical splice sites where feasible. This paper describes how BLAT was optimized. Effects on speed and sensitivity are explored for various K-mer sizes, mismatch schemes, and number of required index matches. BLAT is compared with other alignment programs on various test sets and then used in several genome-wide applications. http://genome.ucsc.edu hosts a web-basedBLAT server for the human genome.

Từ khóa


Tài liệu tham khảo

Altschul, 1990, Basic local alignment search tool., J. Mol. Biol., 215, 403, 10.1016/S0022-2836(05)80360-2

Altschul, 1997, Gapped BLAST and PSI-BLAST: A new generation of protein database search programs., Nucleic Acids Res., 25, 3389, 10.1093/nar/25.17.3389

Chao, 1992, Aligning two sequences within a specified diagonal band., Comput. Appl. Biosci., 8, 481

Dunham, 1999, The DNA sequence of human chromosome 22., Nature, 402, 489, 10.1038/990031

Florea, 1998, A computer program for aligning a cDNA sequence with a genomic DNA sequence., Genome Res., 8, 967, 10.1101/gr.8.9.967

Gish, 1993, Identification of protein coding regions by database similarity search., Nat. Genet., 3, 266, 10.1038/ng0393-266

Gotoh, 1990, Optimal sequence alignment allowing for long gaps., Bull. Math. Biol., 52, 359, 10.1007/BF02458577

Gotoh, 2000, Homology-based gene structure prediction: Simplified matching algorithm using a translated codon (tron) and improved accuracy by allowing for long gaps., Bioinformatics, 16, 190, 10.1093/bioinformatics/16.3.190

2001, Initial sequencing and analysis of the human genome., Nature, 409, 860, 10.1038/35057062

Karplus, 1998, Hidden Markov models for detecting remote protein homologies., Bioinformatics, 14, 846, 10.1093/bioinformatics/14.10.846

Kent, 2000, The Intronerator: Exploring introns and alternative splicing in C. elegans., Nucleic Acids Res., 28, 91, 10.1093/nar/28.1.91

Makalowski, 1998, Evolutionary parameters of the transcribed mammalian genome: An analysis of 2,820 orthologous rodent and human sequences., Proc. Natl. Acad. Sci., 95, 9407, 10.1073/pnas.95.16.9407

Mott, 1997, EST_GENOME: A program to align spliced DNA sequences to unspliced genomic DNA., Comput. Appl. Biosci., 13, 477

Ning, 2001, SSAHA: A fast search method for large DNA databases., Genome Res., 11, 1725, 10.1101/gr.194201

Pearson, 1988, Improved tools for biological sequence comparison., Proc. Natl. Acad. Sci., 85, 2444, 10.1073/pnas.85.8.2444

Roest Crollius, 2000, Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence., Nat. Genet., 25, 235, 10.1038/76118

Smith, 1981, Identification of common molecular subsequences., J. Mol. Biol., 147, 195, 10.1016/0022-2836(81)90087-5

States, 1994, Combined use of sequence similarity and codon bias for coding region identification., J. Comput. Biol., 1, 39, 10.1089/cmb.1994.1.39

Wiehe, 2001, SGP-1: Prediction and validation of homologous genes based on sequence alignments., Genome Res., 11, 1574, 10.1101/gr.177401

Zhang, 2000, A greedy algorithm for aligning DNA sequences., J. Comput. Biol., 7, 203, 10.1089/10665270050081478