Computing distribution of scale independent motifs in biological sequences

Springer Science and Business Media LLC - Tập 1 - Trang 1-11 - 2006
Jonas S Almeida1, Susana Vinga2,3
1Dept Biostatistics and Applied Mathematics, Univ. Texas MDAnderson Cancer Center, Houston, USA
2Instituto de Engenharia de Sistemas e Computadores: Investigação e Desenvolvimento (INESC-ID), Lisboa, Portugal
3Departamento de Bioestatística e Informática, Faculdade de Ciências Médicas – Universidade Nova de Lisboa (FCM/UNL), Lisboa, Portugal

Tóm tắt

The use of Chaos Game Representation (CGR) or its generalization, Universal Sequence Maps (USM), to describe the distribution of biological sequences has been found objectionable because of the fractal structure of that coordinate system. Consequently, the investigation of distribution of symbolic motifs at multiple scales is hampered by an inexact association between distance and sequence dissimilarity. A solution to this problem could unleash the use of iterative maps as phase-state representation of sequences where its statistical properties can be conveniently investigated. In this study a family of kernel density functions is described that accommodates the fractal nature of iterative function representations of symbolic sequences and, consequently, enables the exact investigation of sequence motifs of arbitrary lengths in that scale-independent representation. Furthermore, the proposed kernel density includes both Markovian succession and currently used alignment-free sequence dissimilarity metrics as special solutions. Therefore, the fractal kernel described is in fact a generalization that provides a common framework for a diverse suite of sequence analysis techniques.

Tài liệu tham khảo

Jeffrey HJ: Chaos game representation of gene structure. Nucleic Acids Res. 1990, 18 (8): 2163-2170. Goldman N: Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences. Nucleic Acids Res. 1993, 21 (10): 2487-2491. Almeida JS, Carrico JA, Maretzek A, Noble PA, Fletcher M: Analysis of genomic sequences by Chaos Game Representation. Bioinformatics. 2001, 17 (5): 429-437. Dufraigne C, Fertil B, Lespinats S, Giron A, Deschavanne P: Detection and characterization of horizontal transfers in prokaryotes using genomic signature. Nucleic Acids Res. 2005, 33 (1): e6- Deschavanne PJ, Giron A, Vilain J, Fagot G, Fertil B: Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol Biol Evol. 1999, 16 (10): 1391-1399. Schwacke J, Almeida JS: Efficient Boolean implementation of universal sequence maps (bUSM). BMC Bioinformatics. 2002, 3 (1): 28- Hess CM, Gasper J, Hoekstra HE, Hill CE, Edwards SV: MHC class II pseudogene and genomic signature of a 32-kb cosmid in the house finch (Carpodacus mexicanus). Genome Res. 2000, 10 (5): 613-623. Rice P, Longden I, Bleasby A: EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 2000, 16 (6): 276-277. Vinga S, Almeida J: Alignment-free sequence comparison-a review. Bioinformatics. 2003, 19 (4): 513-523. Vinga S, Gouveia-Oliveira R, Almeida JS: Comparative evaluation of word composition distances for the recognition of SCOP relationships. Bioinformatics. 2004, 20 (2): 206-215. Karlin S, Mrazek J, Campbell AM: Compositional biases of bacterial genomes and evolutionary implications. J Bacteriol. 1997, 179 (12): 3899-3913. Karlin S, Mrazek J, Gentles AJ: Genome comparisons and analysis. Curr Opin Struct Biol. 2003, 13 (3): 344-352. Wang Y, Hill K, Singh S, Kari L: The spectrum of genomic signatures: from dinucleotides to chaos game representation. Gene. 2005, 346: 173-185. Almeida JS, Vinga S: Universal sequence map (USM) of arbitrary discrete sequences. BMC Bioinformatics. 2002, 3 (1): 6- Vinga S, Almeida JS: Renyi continuous entropy of DNA sequences. J Theor Biol. 2004, 231 (3): 377-388. Vinga S, Gouveia-Oliveira R, Almeida JS: Comparative evaluation of word composition distances for the recognition of SCOP relationships. Bioinformatics. 2004, 20: 206-215. Tino P, Dorffner G: Predicting the Future of Discrete Sequences from Fractal Representations of the Past. Machine Learning. 2001, 45 (2): 187-217. 10.1023/A:1010972803901. Cowell LG, Davila M, Kepler TB, Kelsoe G: Identification and utilization of arbitrary correlations in models of recombination signal sequences. Genome Biol. 2002, 3 (12): RESEARCH0072- Bejerano G: Algorithms for variable length Markov chain modeling. Bioinformatics. 2004, 20 (5): 788-789. Bühlmann P, Wyner AJ: Variable length Markov chains. Annals of Statistics. 1999, 27: 480-513. 10.1214/aos/1018031204. Gutierrez JM, Rodriguez MA, Abramson G: Multifractal analysis of DNA sequences using a novel chaos-game representation. Physica A: Statistical Mechanics and its Applications. 2001, 300 (1-2): 271-284. 10.1016/S0378-4371(01)00333-8. Almeida JS: GeneChaos.ORG resource. http://genechaos.org Helmann JD: Compilation and analysis of Bacillus subtilis sigma A-dependent promoter sequences: evidence for extended contact between RNA polymerase and upstream promoter DNA. Nucleic Acids Res. 1995, 23 (13): 2351-2360. Vanet A, Marsan L, Sagot MF: Promoter sequences and algorithmical methods for identifying them. Res Microbiol. 1999, 150 (9-10): 779-799.