transAlign: using amino acids to facilitate the multiple alignment of protein-coding DNA sequences

BMC Bioinformatics - Tập 6 - Trang 1-6 - 2005
Olaf RP Bininda-Emonds1
1Lehrstuhl für Tierzucht, Technical University of Munich, Freising-Weihenstephan, Germany

Tóm tắt

Alignments of homologous DNA sequences are crucial for comparative genomics and phylogenetic analysis. However, multiple alignment represents a computationally difficult problem. For protein-coding DNA sequences, it is more advantageous in terms of both speed and accuracy to align the amino-acid sequences specified by the DNA sequences rather than the DNA sequences themselves. Many implementations making use of this concept of "translated alignments" are incomplete in the sense that they require the user to manually translate the DNA sequences and to perform the amino-acid alignment. As such, they are not well suited to large-scale automated alignments of large and/or numerous DNA data sets. transAlign is an open-source Perl script that aligns protein-coding DNA sequences via their amino-acid translations to take advantage of the superior multiple-alignment capabilities and speed of an amino-acid alignment. It operates by translating each DNA sequence into its corresponding amino-acid sequence, passing the entire matrix to ClustalW for alignment, and then back-translating the resulting amino-acid alignment to derive the aligned DNA sequences. In the translation step, transAlign determines the optimal orientation and reading frame for each DNA sequence according to the desired genetic code. It also checks for apparent frame shifts in the DNA sequences and can handle frame-shifted sequences in one of three ways (delete, align as amino acids regardless, or profile align as DNA). As a set of comparative benchmarks derived from six protein-coding genes for mammals shows, the strategy implemented in transAlign always improves the speed and usually the apparent accuracy of the alignment of protein-coding DNA sequences. transAlign represents one of few full and cross-platform implementations of the concept of translated alignments. Both the advantages accruing from performing a translated alignment and the suite of user-definable options available in the program mean that transAlign is ideally suited for large-scale automated alignments of very large and/or very numerous protein-coding DNA data sets. However, the good performance offered by the program also translates to the alignment of any set of protein-coding sequences. transAlign, including the source code, is freely available at http://www.tierzucht.tum.de/Bininda-Emonds/ (under "Programs").

Tài liệu tham khảo

Haubold B, Wiehe T: Comparative genomics: methods and applications. Naturwissenschaften 2004, 91: 405–421. Wernersson R, Pedersen AG: RevTrans: Multiple alignment of coding DNA from aligned amino acid sequences. Nucleic Acids Res 2003, 31: 3537–3539. 10.1093/nar/gkg609 Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 1992, 89: 10915–10919. Gonnet GH, Cohen MA, Benner SA: Exhaustive matching of the entire protein sequence database. Science 1992, 256: 1443–1445. Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary change in proteins. In Atlas of Protein Sequence Structure. Volume 5. Edited by: Dayhoff MO. Washington, D.C.: National Biomedical Research Foundation; 1978:345–352. Posada D, Crandall KA: Selecting the best-fit model of nucleotide substitution. Syst Biol 2001, 50: 580–601. 10.1080/106351501750435121 Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981, 147: 195–197. 10.1016/0022-2836(81)90087-5 MRTRANS – CDNA alignment based on protein alignment[http://www.rfcgr.mrc.ac.uk/Registered/Option/mrtrans.html] RevTrans Server[http://www.cbs.dtu.dk/services/RevTrans/] Morgenstern B: DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics 1999, 15: 211–218. 10.1093/bioinformatics/15.3.211 Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, Green ED, Sidow A, Batzoglou S: LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome Res 2003, 13: 721–731. 10.1101/gr.926603 Maddison DR, Swofford DL, Maddison WP: NEXUS: an extensible file format for systematic information. Syst Biol 1997, 46: 590–621. Felsenstein J: PHYLIP (Phylogeny Inference Package), version 3.6. Seattle: Department of Genome Sciences, University of Washington; 2004. Guindon S, Gascuel O: A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst Biol 2003, 52: 696–704. 10.1080/10635150390235520 Se-Al Homepage[http://evolve.zoo.ox.ac.uk/software.html?name=Se-Al] Readseq Homepage[http://iubio.bio.indiana.edu/soft/molbio/readseq/java/] HMMER: sequence analysis using profile hidden Markov models[http://hmmer.wustl.edu/] NCBI Taxonomy Homepage[http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi] Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22: 4673–4680. Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, Thompson JD: Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res 2003, 31: 3497–3500. 10.1093/nar/gkg500 Wain HM, Lush M, Ducluzeau F, Povey S: Genew: the human gene nomenclature database. Nucleic Acids Res 2002, 30: 169–171. 10.1093/nar/30.1.169