Computing expectation values for RNA motifs using discrete convolutions

André Lambert1, Matthieu Legendre2, Jean-Fred Fontaine2, Daniel Gautheret2
1CNRS UMR 6207, Université de la Méditerranée, Luminy Case 907, 13288, Marseille, cedex 9, France
2INSERM ERM 206, Université de la Méditerranée, Luminy Case 928, 13288, Marseille, Cedex 9, France

Tóm tắt

Abstract Background Computational biologists use Expectation values (E-values) to estimate the number of solutions that can be expected by chance during a database scan. Here we focus on computing Expectation values for RNA motifs defined by single-strand and helix lod-score profiles with variable helix spans. Such E-values cannot be computed assuming a normal score distribution and their estimation previously required lengthy simulations. Results We introduce discrete convolutions as an accurate and fast mean to estimate score distributions of lod-score profiles. This method provides excellent score estimations for all single-strand or helical elements tested and also applies to the combination of elements into larger, complex, motifs. Further, the estimated distributions remain accurate even when pseudocounts are introduced into the lod-score profiles. Estimated score distributions are then easily converted into E-values. Conclusion A good agreement was observed between computed E-values and simulations for a number of complete RNA motifs. This method is now implemented into the ERPIN software, but it can be applied as well to any search procedure based on ungapped profiles with statistically independent columns.

Từ khóa


Tài liệu tham khảo

Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403–10. 10.1006/jmbi.1990.9999

Karlin S, Altschul SF: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci U S A 1990, 87: 2264–8.

Gautheret D, Major F, Cedergren R: Pattern searching/alignment with RNA primary and secondary structures: an effective descriptor for tRNA. Comput Appl Biosci 1990, 6: 325–31.

Billoud B, Kontic M, Viari A: Palingol: a declarative programming language to describe nucleic acids' secondary structures and to scan sequence database. Nucleic Acids Res 1996, 24: 1395–403. 10.1093/nar/24.8.1395

Macke TJ, Ecker DJ, Gutell RR, Gautheret D, Case DA, Sampath R: RNAMotif, an RNA secondary structure definition and search algorithm. Nucleic Acids Res 2001, 29: 4724–35. 10.1093/nar/29.22.4724

Eddy SR, Durbin R: RNA sequence analysis using covariance models. Nucleic Acids Res 1994, 22: 2079–88.

Lambert A, Lescure A, Gautheret D: A survey of metazoan selenocysteine insertion sequences. Biochimie 2002, 84: 953–9. 10.1016/S0300-9084(02)01441-4

Gautheret D, Lambert A: Direct RNA motif definition and identification from multiple sequence alignments using secondary structure profiles. J Mol Biol 2001, 313: 1003–11. 10.1006/jmbi.2001.5102

Henikoff JG, Henikoff S: Using substitution probabilities to improve position-specific scoring matrices. Comput Appl Biosci 1996, 12: 135–43.

Cannone JJ, Subramanian S, Schnare MN, Collett JR, D'Souza LM, Du Y, Feng B, Lin N, Madabusi LV, Muller KM, Pande N, Shang Z, Yu N, Gutell RR: The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics 2002, 3: 2. 10.1186/1471-2105-3-2

Feller W: An Introduction to Probability Theory and its Applications. Third edition. John Wiley & sons; 1968.

Eaton JW: GNU Octave Manual: A high-level interactive langage for numerical computations.1997. [http://www.octave.org/docs.html]

Matlab: High-Performance Numeric Computation and Visual Software. The MathWorks, Inc

Press WH, Teukolsky SA, Vetterling WT, Flannery BP: Numerical Recipes in C. Second edition. Cambridge University Press; 1994.

Legendre M, Gautheret D: Sequence determinants in human polyadenylation site selection. BMC Genomics 2003, 4: 7. 10.1186/1471-2164-4-7

Legendre M, Lambert A, Gautheret D: Profile-based detection of microRNA precursors in animal genomes. Bioinformatics 21(7):841–5. 2005 Apr 1 10.1093/bioinformatics/bti073

Sprinzl M, Dank N, Nock S, Schon A: Compilation of tRNA sequences and sequences of tRNA genes. Nucl Acids Res 1991, 19: 2127–2171.

Klein RJ, Eddy SR: RSEARCH: Finding homologs of single structured RNA sequences. BMC Bioinformatics 2003, 4: 44. 10.1186/1471-2105-4-44