Identifying Statistical Dependence in Genomic Sequences via Mutual Information Estimates
Tóm tắt
Questions of understanding and quantifying the representation and amount of information in organisms have become a central part of biological research, as they potentially hold the key to fundamental advances. In this paper, we demonstrate the use of information-theoretic tools for the task of identifying segments of biomolecules (DNA or RNA) that are statistically correlated. We develop a precise and reliable methodology, based on the notion of mutual information, for finding and extracting statistical as well as structural dependencies. A simple threshold function is defined, and its use in quantifying the level of significance of dependencies between biological segments is explored. These tools are used in two specific applications. First, they are used for the identification of correlations between different parts of the maize zmSRp32 gene. There, we find significant dependencies between the untranslated region in zmSRp32 and its alternatively spliced exons. This observation may indicate the presence of as-yet unknown alternative splicing mechanisms or structural scaffolds. Second, using data from the FBI's combined DNA index system (CODIS), we demonstrate that our approach is particularly well suited for the problem of discovering short tandem repeats—an application of importance in genetic profiling.
Tài liệu tham khảo
citation_journal_title=Bioinformatics; citation_title=The mutual information: detecting and evaluating dependencies between variables; citation_author=R Steuer, J Kurths, CO Daub, J Weise, J Selbig; citation_volume=18; citation_issue=supplement 2; citation_publication_date=2002; citation_pages=S231-S240; citation_doi=10.1093/bioinformatics/18.suppl_2.S231; citation_id=CR1
citation_journal_title=IEEE/ACM Transactions on Computational Biology and Bioinformatics; citation_title=Gene mapping and marker clustering using Shannon's mutual information; citation_author=Z Dawy, B Goebel, J Hagenauer, C Andreoli, T Meitinger, JC Mueller; citation_volume=3; citation_issue=1; citation_publication_date=2006; citation_pages=47-56; citation_doi=10.1109/TCBB.2006.9; citation_id=CR2
citation_journal_title=Nature; citation_title=A genomic code for nucleosome positioning; citation_author=E Segal, Y Fondufe-Mittendorf, L Chen; citation_volume=442; citation_issue=7104; citation_publication_date=2006; citation_pages=772-778; citation_doi=10.1038/nature04979; citation_id=CR3
citation_journal_title=Gene; citation_title=Comparative analysis of base correlations in
untranslated regions of various species; citation_author=Y Osada, R Saito, M Tomita; citation_volume=375; citation_issue=1-2; citation_publication_date=2006; citation_pages=80-86; citation_doi=10.1016/j.gene.2006.02.018; citation_id=CR4
citation_journal_title=Gene; citation_title=Initiation of translation in prokaryotes and eukaryotes; citation_author=M Kozak; citation_volume=234; citation_issue=2; citation_publication_date=1999; citation_pages=187-208; citation_doi=10.1016/S0378-1119(99)00210-3; citation_id=CR5
citation_journal_title=Genomics, Proteomics and Bioinformatics; citation_title=Comparative analysis of transcription start sites using mutual information; citation_author=DA Reddy, CK Mitra; citation_volume=4; citation_issue=3; citation_publication_date=2006; citation_pages=189-195; citation_doi=10.1016/S1672-0229(06)60032-6; citation_id=CR6
citation_journal_title=Computational Biology and Chemistry; citation_title=Comparative analysis of core promoter region: information content from mono and dinucleotide substitution matrices; citation_author=DA Reddy, BVLS Prasad, CK Mitra; citation_volume=30; citation_issue=1; citation_publication_date=2006; citation_pages=58-62; citation_doi=10.1016/j.compbiolchem.2005.10.004; citation_id=CR7
citation_journal_title=Nucleic Acids Research; citation_title=Comparative analysis of orthologous eukaryotic mRNAs: potential hidden functional signals; citation_author=SA Shabalina, AY Ogurtsov, IB Rogozin, EV Koonin, DJ Lipman; citation_volume=32; citation_issue=5; citation_publication_date=2004; citation_pages=1774-1782; citation_doi=10.1093/nar/gkh313; citation_id=CR8
citation_journal_title=Bioinformatics; citation_title=Exploiting the past and the future in protein secondary structure prediction; citation_author=P Baldi, S Brunak, P Frasconi, G Soda, G Pollastri; citation_volume=15; citation_issue=11; citation_publication_date=1999; citation_pages=937-946; citation_doi=10.1093/bioinformatics/15.11.937; citation_id=CR9
citation_journal_title=IEEE Engineering in Medicine and Biology Magazine; citation_title=Should genetics get an information-theoretic education? Genomes as error-correcting codes; citation_author=G Battail; citation_volume=25; citation_issue=1; citation_publication_date=2006; citation_pages=34-45; citation_doi=10.1109/MEMB.2006.1578662; citation_id=CR10
citation_journal_title=Gene; citation_title=ASF/SF2-like maize pre-mRNA splicing factors affect splice site utilization and their transcripts are alternatively spliced; citation_author=H Gao, WJ Gordon-Kamm, LA Lyznik; citation_volume=339; citation_issue=1-2; citation_publication_date=2004; citation_pages=25-37; citation_doi=10.1016/j.gene.2004.06.047; citation_id=CR11
citation_title=Elements of Information Theory; citation_publication_date=1991; citation_id=CR12; citation_author=TM Cover; citation_author=JA Thomas; citation_publisher=John Wiley & Sons
citation_title=Resampling Methods; citation_publication_date=2005; citation_id=CR13; citation_author=PI Good; citation_publisher=Birkhäuser
citation_title=Randomization, Bootstrap and Monte Carlo Methods in Biology; citation_publication_date=1977; citation_id=CR14; citation_author=B Manly; citation_publisher=Chapman & Hall/CRC
citation_title=Testing Statistical Hypotheses; citation_publication_date=2005; citation_id=CR15; citation_author=EL Lehmann; citation_author=JP Romano; citation_publisher=Springer
citation_title=Theory of Statistics; citation_publication_date=1995; citation_id=CR16; citation_author=MJ Schervish; citation_publisher=Springer
citation_title=Genomic analysis using methods from information theory; citation_inbook_title=Proceedings of IEEE Information Theory Workshop (ITW '04), San Antonio, Tex, USA; citation_publication_date=2004; citation_pages=55-59; citation_id=CR17; citation_author=J Hagenauer; citation_author=Z Dawy; citation_author=B Göbel; citation_author=P Hanus; citation_author=J Mueller
citation_journal_title=Proceedings of IEEE International Conference on Communications (ICC '05), Seoul, Korea; citation_title=An approximation to the distribution of finite sample size mutual information estimates; citation_author=B Goebel, Z Dawy, J Hagenauer, JC Mueller; citation_volume=2; citation_publication_date=2005; citation_pages=1102-1106; citation_id=CR18
citation_title=Distribution of mutual information; citation_inbook_title=Advances in Neural Information Processing Systems 14; citation_publication_date=2002; citation_pages=399-406; citation_id=CR19; citation_author=M Hutter; citation_publisher=MIT Press
citation_journal_title=Trends in Genetics; citation_title=Regulation of gene expression by alternative untranslated regions; citation_author=TA Hughes; citation_volume=22; citation_issue=3; citation_publication_date=2006; citation_pages=119-122; citation_doi=10.1016/j.tig.2006.01.001; citation_id=CR20
citation_title=Multialphabet coding with separate alphabet description; citation_inbook_title=Proceedings of the International Conference on Compression and Complexity of Sequences, Positano, Italy; citation_publication_date=1997; citation_pages=56-65; citation_id=CR21; citation_author=J Åberg; citation_author=YuM Shtarkov; citation_author=BJM Smeets
citation_journal_title=IEEE Transactions on Information Theory; citation_title=Limit results on pattern entropy; citation_author=A Orlitsky, NP Santhanam, K Viswanathan, J Zhang; citation_volume=52; citation_issue=7; citation_publication_date=2006; citation_pages=2954-2964; citation_doi=10.1109/TIT.2006.876351; citation_id=CR22