Exact p-value calculation for heterotypic clusters of regulatory motifs and its application in computational annotation of cis-regulatory modules

Valentina Boeva1,2, Julien Clément3, Mireille Régnier2, Mikhail A. Roytberg4,5, Vsevolod J. Makeev6,1
1Institute of Genetics and Selection of Industrial Microorganisms, GosNIIGenetika, Moscow, Russia
2MIGEC, INRIA Rocquencourt, Le Chesnay, France
3GREYC, CNRS UMR 6072, Laboratoire d'informatique, Caen, France
4Institute of Mathematical Problems of Biology, Russian Academy of Sciences, Puschino, Russia
5Puschino State University, Puschino, Russia
6Engelhardt Institute of Molecular Biology, Russian Academy of Sciences, Moscow, Russia

Tóm tắt

Từ khóa


Tài liệu tham khảo

MacIsaac KD, Fraenkel E: Practical strategies for discovering regulatory DNA sequence motifs. PloS Comput Biol. 2006, 2 (4): e36- 10.1371/journal.pcbi.0020036

Sandve GK, Drablos F: A survey of motif discovery methods in an integrated framework. Biol Direct. 2006, 1: 11- 10.1186/1745-6150-1-11

Rombauts S, Florquin K, Lescot M, Marchal K, Rouze P, van de Peer Y: Computational approaches to identify promoters and cis-regulatory elements in plant genomes. Plant Physiol. 2003, 132 (3): 1162-1176. Review. 10.1104/pp.102.017715

Bulyk ML: DNA microarray technologies for measuring protein-DNA interactions. Curr Opin Biotechnol. 2006, 17 (4): 422-30. 10.1016/j.copbio.2006.06.015

Harbison CT, Gordon B, Lee TI, Rinaldi NJ, Macisaac KD, Danford T, Hannett NM, Tagne JB, Reynolds DB, Yoo J, Jennings EG, Zeitlinger J, Pokholok DK, Kellis M, Rolfe PA, Takusagawa KT, Lander ES, Gifford DK, Fraenkel E, Young RA: Transcriptional regulatory code of a eukaryotic genome. Nature. 2004, 431: 99-104. 10.1038/nature02800

Zhu Z, Shendure J, Church GM: Discovering functional transcription-factor combinations in the human cell cycle. Genome Res. 2005, 15 (6): 848-55. 10.1101/gr.3394405

Clyde DE, Corado MS, Wu X, Pare A, Papatsenko D, Small S: A self-organizing system of repressor gradients establishes segmental complexity in Drosophila. Nature. 2003, 426 (6968): 849-53. 10.1038/nature02189

Wagner A: Genes regulated cooperatively by one or more transcription factors and their identification in whole eukaryotic genomes. Bioinformatics. 1999, 15 (10): 776-784. 10.1093/bioinformatics/15.10.776

Lifanov AP, Makeev VJ, Nazina AG, Papatsenko DA: Homotypic regulatory clusters in Drosophila. Genome Res. 2003, 13 (4): 579-88. 10.1101/gr.668403

Brown CT, Rust AG, Clarke PJ, Pan Z, Schilstra MJ, De Buysscher T, Griffin G, Wold BJ, Cameron RA, Davidson EH, Bolouri H: New computational approaches for analysis of cis-regulatory networks. Dev Biol. 2002, 246: 86-102. 10.1006/dbio.2002.0619

Wagner A: A computational genomics approach to the identification of gene networks. Nucleic Acids Res. 1997, 25 (18): 3594-3604. 10.1093/nar/25.18.3594

Liaw GJ, Lengyel JA: Control of tailless expression by bicoid, dorsal and synergistically interacting terminal system regulatory elements. Mech Dev. 1993, 40 (1–2): 47-61. 10.1016/0925-4773(93)90087-E

Jun S, Desplan C: Cooperative interactions between paired domain and homeodomain. Development. 1996, 122 (9): 2639-50.

Mitashev VI, Koussoulakos S, Zinov'eva RD, Ozerniuk ND, Mikaelian AS, Shmukler E, Smirnova Iu A: [Constructive synergism of regulatory genes expressed in the course of the eye and muscle development and regeneration]. Izv Akad Nauk Ser Biol. 2001, 261-75. 3

Klingenhoff A, Frech K, Werner T: Regulatory modules shared within gene classes as well as across gene classes can be detected by the same in silico approach. In Silico Biol. 2002, 2: S17-26.

Kato M, Hata N, Banerjee N, Futcher B, Zhang MQ: Identifying combinatorial regulation of transcription factors and binding motifs. Genome Biol. 2004, 5 (8): R56-Epub 2004 Jul 28. 10.1186/gb-2004-5-8-r56

Hu YJ, Sandmeyer S, McLaughlin C, Kibler D: Combinatorial motif analysis and hypothesis generation on a genomic scale. Bioinformatics. 2000, 16 (3): 222-32. 10.1093/bioinformatics/16.3.222

Jegga AG, Sherwood SP, Carman JW, Pinski AT, Phillips JL, Pestian JP, Aronow BJ: Detection and visualization of compositionally similar cis-regulatory element clusters in orthologous and coordinately controlled genes. Genome Res. 2002, 12 (9): 1408-17. 10.1101/gr.255002

Li H, Rhodius V, Gross C, Siggia ED: Identification of the binding sites of regulatory proteins in bacterial genomes. Proc Natl Acad Sci USA. 2002, 99 (18): 11772-7. Epub 2002 Aug 14. 10.1073/pnas.112341999

Markstein M, Zinzen R, Markstein P, Yee KP, Erives A, Stathopoulos A, Levine M: A regulatory code for neurogenic gene expression in the Drosophila embryo. Development. 2004, 131 (10): 2387-94. 10.1242/dev.01124

Makeev V, Lifanov A, Nazina A, Papatsenko D: Distance preferences in distribution of binding motifs and hierarchical levels in organization of transcription regulatory information. Nucleic Acids Res. 2003, 31 (20): 6016-26. 10.1093/nar/gkg799

Halfon MS, Michelson AM: Exploring genetic regulatory networks in metazoan development: methods and models. Physiol Genomics. 2002, 10 (3): 131-43.

Papatsenko D: ClusterDraw web server: a tool to identify and visualize clusters of binding motifs for transcription factors. Bioinformatics. 2007, 23 (8): 1032-1034. 10.1093/bioinformatics/btm047

Aerts S, Loo PV, Thijs G, Moreau Y, Moor BD: Computational detection of cis -regulatory modules. Bioinformatics. 2003, 19 (2): II5-II14. 10.1093/bioinformatics/btg1052

Bailey T, Noble W: Searching for statistically significant regulatory modules. Bioinformatics. 2003, 19 (2): II16-II25. 10.1093/bioinformatics/btg1054

Berman B, Pfeiffer B, Laverty T, Salzberg S, Rubin G, Eisen M, Celniker S: Computational identification of developmental enhancers: conservation and function of transcription factor binding-site clusters in Drosophila melanogaster and Drosophila pseudoobscura. Genome Biol. 2004, 5 (9): R61- 10.1186/gb-2004-5-9-r61

Frith M, Hansen U, Weng Z: Detection of cis-element clusters in higher eukaryotic DNA. Bioinformatics. 2001, 17 (10): 878-889. 10.1093/bioinformatics/17.10.878

Frith MC, Li MC, Weng Z: Cluster-Buster: Finding dense clusters of motifs in DNA sequences. Nucleic Acids Res. 2003, 31 (13): 3666-3668. 10.1093/nar/gkg540

Sosinsky A, Bonin C, Mann R, Honig B: Target Explorer: an automated tool for the identification of new target genes for a specified set of transcription factors. Nucleic Acids Research. 2003, 31 (13): 3589-3592. 10.1093/nar/gkg544

Krivan W: Searching for transcription factor binding site clusters: how true are true positives?. J Bioinform Comput Biol. 2004, 2 (2): 413-6. 10.1142/S021972000400065X

Papatsenko D, Makeev V, Lifanov A, Régnier M, Nazina A, Desplan C: Extraction of Functional Binding Sites from Unique Regulatory Regions: The Drosophila Early Developmental Enhancers. Genome Research. 2002, 12: 470-481. [Preliminary version in Drosophila Workshop, Washington 2001]. 10.1101/gr.212502. Article published online before print in February 2002

Markstein M, Markstein P, Markstein V, Levine M: Genome-wide Analysis of Clustered Dorsal Binding Sites Identifies Putative Target Genes in the Drosophila Embryo. PNAS. 2002, 99 (2): 763-768. 10.1073/pnas.012591199

Rebeiz M, Reeves NL, Posakony JW: SCORE: a computational approach to the identification of cis-regulatory modules and target genes in whole-genome sequence data. Site clustering over random expectation. Proc Natl Acad Sci USA. 2002, 99 (15): 9888-93. Epub 2002 Jul 09. 10.1073/pnas.152320899

Lifanov A, Makeev V, Nazina A, Papatsenko D: Uniform clusters in Drosophila. Genome Res. 2003, 13 (4): 579-588. 10.1101/gr.668403

Staden R: Methods for calculating the probabilities of finding patterns in sequences. Comput Appl Biosci. 1989, 5 (2): 89-96.

Ellington A, Szostak J: In vitro selection of RNA molecules that bind specific ligands. Nature. 1990, 346: 818-822. 10.1038/346818a0

Tuerk C, Gold L: Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase. Science. 1990, 249: 505-510. 10.1126/science.2200121

Berger MF, Philippakis AA, Qureshi AM, He FS, Estep PW, Bulyk ML: Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat Biotechnol. 2006, 24: 1429-1435. 10.1038/nbt1246

Liu Y, Yokota H: Modeling Transcriptional Regulation in Chondrogenesis Using Particle Swarm Optimization. IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB2005. 2005, 311-317.

IUPAC codes. http://bioinformatics.org/sms2/iupac.html

Berg OG: Selection of DNA binding sites by regulatory proteins. Functional specificity and pseudosite competition. J Biomol Struct Dyn. 1988, 6 (2): 275-297.

Knuth DE: The Art of Computer Programming, Sorting and Searching. 1973, 3: Addison-Wesley

Zhang J, Jiang B, Li M, Tromp J, Zhang X, Zhang M: Computing exact P-values for DNA motifs. Bioinformatics. 2007, 23 (5): 531-537. 10.1093/bioinformatics/btl662

Hertzberg L, Zuk O, Getz G, Domany E: Finding Motifs in Promoter Regions. Journal of Computational Biology. 2005, 12 (3): 314-330. 10.1089/cmb.2005.12.314

Robin S, Daudin JJ: Exact distribution of word occurrences in a random sequence of letters. J Appl Prob. 1999, 36: 179-193. 10.1239/jap/1032374240. 10.1239/jap/1032374240

Chrysaphinou C, Papastavridis S: The Occurrence of Sequence of Patterns in Repeated Dependent Experiments. Theory of Probability and Applications. 1990, 79: 167-173.

Guibas L, Odlyzko A: String Overlaps, Pattern Matching and Nontransitive Games. Journal of Combinatorial Theory, Series A. 1981, 30: 183-208. 10.1016/0097-3165(81)90005-4. 10.1016/0097-3165(81)90005-4

Tanushev M, Arratia R: Central Limit Theorem for Renewal Theory for Several Patterns. Journal of Computational Biology. 1997, 4: 35-44.

Nicodème P, Salvy B, Flajolet P: Motif Statistics. Theoretical Computer Science. 2002, 287 (2): 593-618. 10.1016/S0304-3975(01)00264-X. [Preliminary version at ESA'99]. 10.1016/S0304-3975(01)00264-X

Régnier M: A Unified Approach to Word Occurrences Probabilities. Discrete Applied Mathematics. 2000, 104: 259-280. 10.1016/S0166-218X(00)00195-5. [Special issue on Computational Biology;preliminary version at RECOMB'98]. 10.1016/S0166-218X(00)00195-5

Szpankowski W: Average Case Analysis of Algorithms on Sequences. 2001, New York: John Wiley and Sons

Bassino F, Clément J, Fayolle J, Nicodème P: Counting occurrences for a finite set of words: an inclusion-exclusion approach. 2007 International Conference on Analysis of Algorithms (AofA'07), Discrete Mathematics and Theoretical Computer Science. 2007, 12-

Park Y, Spouge J: Searching for Multiple Words inMarkov Sequences. INFORMS journal of Computing. 2004, 16 (4): 341-347. 10.1287/ijoc.1040.0095. 10.1287/ijoc.1040.0095

Nicodème P: Regexpcount, a symbolic package for counting problems on regular expressions and words. Fundamenta Informaticae. 2003, 56 (1–2): 71-88.

Klaerr-Blanchard M, Chiapello H, Coward E: Detecting localized repeats in genomic sequences: A new strategy and its application to B. subtilis and A. thaliana sequences. Comput Chem. 2000, 24: 57-70. 10.1016/S0097-8485(99)00047-9

Reinert G, Schbath S: Compound Poisson Approximation for Occurrences of Multiple Words in Markov Chains. Journal of Computational Biology. 1998, 5 (2): 223-253.

Régnier M, Vandenbogaert M: Comparison of statistical significance criteria. J Bioinform Comput Biol. 2006, 4 (2): 537-551. 10.1142/S0219720006002028

Régnier M: Mathematical Tools for Regulatory Signals Extraction. Bioinformatics of Genome Regulation and Structure. Edited by: Kolchanov N, Hofestaedt R. 2004, 61-70. [Preliminary version at BGRS'02]., Kluwer Academic Publisher

Régnier M, Denise A: Rare events and Conditional Events on random strings. DMTCS. 2004, 6 (2): 191-214.

Boeva V, Clément J, Régnier M, Vandenbogaert M: Assessing the significance of Sets of Words. CPM'05, of Lecture Notes in Computer Science. 2005, 3537: 358-370. [Proc. CPM'05, Jeju Island, Korea]., Springer-Verlag

Kucherov G, Noé L, Roytberg M: Multi-seed lossless filtration. Proceedings of the 15th Annual Combinatorial Pattern Matching Symposium (CPM), Istanbul (Turkey), of Lecture Notes in Computer Science. Edited by: Sahinalp S, Muthukrishnan S, Dogrusoz U. 2004, 3109: 297-310. Springer Verlag

Aho A, Corasick M: Efficient String Matching. CACM. 1975, 18 (6): 333-340.

Small S, Blair A, Levine M: Regulation of even-skipped stripe 2 in the Drosophila embryo. Embo Journal. 1992, 11 (13): 4047-4057.

Reinert G, Schbath S: Compound Poisson and Poisson process approximations for occurrences of multiple words in Markov chains. J Comput Biol. 1998, 5 (2): 223-53.

Wasserman W, Fickett J: Identification of regulatory regions which confer muscle-specific gene expression. J Mol Biol. 1998, 278: 167-81. 10.1006/jmbi.1998.1700

Tompa M, Li N, Bailey T, Church G, De Moor B, Eskin E, Favorov A, Frith M, Fu Y, Kent J, Makeev V, Mironov A, Noble W, Pavesi G, Pesole G, Régnier M, Simonis N, Sinha S, Thijs G, van Helden J, Vandenbogaert M, Weng Z, Workman C, Ye C, Zhu Z: An Assessment of Computational Tools for the Discovery of Transcription Factor Binding Sites. Nature Biotechnology. 2005, 23: 137-144. 10.1038/nbt1053

Blanchette M, Sinha S: Separating real motifs from their artifacts. Bioinformatics. 2001, 17 (Suppl 1): S30-8.