Identifying Statistical Dependence in Genomic Sequences via Mutual Information Estimates

Springer Science and Business Media LLC - Tập 2007 Số 1 - Trang 1-11 - 2007
Aktulga, Hasan Metin1, Kontoyiannis, Ioannis2, Lyznik, L Alex3, Szpankowski, Lukasz4, Grama, Ananth Y1, Szpankowski, Wojciech1
1Department of Computer Science, Purdue University, West Lafayette, USA
2Department of Informatics, Athens University of Economics & Business, Athens, Greece
3Pioneer Hi-Breed International, Johnston, USA
4Bioinformatics Program, University of California, San Diego, USA

Tóm tắt

Questions of understanding and quantifying the representation and amount of information in organisms have become a central part of biological research, as they potentially hold the key to fundamental advances. In this paper, we demonstrate the use of information-theoretic tools for the task of identifying segments of biomolecules (DNA or RNA) that are statistically correlated. We develop a precise and reliable methodology, based on the notion of mutual information, for finding and extracting statistical as well as structural dependencies. A simple threshold function is defined, and its use in quantifying the level of significance of dependencies between biological segments is explored. These tools are used in two specific applications. First, they are used for the identification of correlations between different parts of the maize zmSRp32 gene. There, we find significant dependencies between the untranslated region in zmSRp32 and its alternatively spliced exons. This observation may indicate the presence of as-yet unknown alternative splicing mechanisms or structural scaffolds. Second, using data from the FBI's combined DNA index system (CODIS), we demonstrate that our approach is particularly well suited for the problem of discovering short tandem repeats—an application of importance in genetic profiling.

Tài liệu tham khảo

citation_journal_title=Bioinformatics; citation_title=The mutual information: detecting and evaluating dependencies between variables; citation_author=R Steuer, J Kurths, CO Daub, J Weise, J Selbig; citation_volume=18; citation_issue=supplement 2; citation_publication_date=2002; citation_pages=S231-S240; citation_doi=10.1093/bioinformatics/18.suppl_2.S231; citation_id=CR1 citation_journal_title=IEEE/ACM Transactions on Computational Biology and Bioinformatics; citation_title=Gene mapping and marker clustering using Shannon's mutual information; citation_author=Z Dawy, B Goebel, J Hagenauer, C Andreoli, T Meitinger, JC Mueller; citation_volume=3; citation_issue=1; citation_publication_date=2006; citation_pages=47-56; citation_doi=10.1109/TCBB.2006.9; citation_id=CR2 citation_journal_title=Nature; citation_title=A genomic code for nucleosome positioning; citation_author=E Segal, Y Fondufe-Mittendorf, L Chen; citation_volume=442; citation_issue=7104; citation_publication_date=2006; citation_pages=772-778; citation_doi=10.1038/nature04979; citation_id=CR3 citation_journal_title=Gene; citation_title=Comparative analysis of base correlations in untranslated regions of various species; citation_author=Y Osada, R Saito, M Tomita; citation_volume=375; citation_issue=1-2; citation_publication_date=2006; citation_pages=80-86; citation_doi=10.1016/j.gene.2006.02.018; citation_id=CR4 citation_journal_title=Gene; citation_title=Initiation of translation in prokaryotes and eukaryotes; citation_author=M Kozak; citation_volume=234; citation_issue=2; citation_publication_date=1999; citation_pages=187-208; citation_doi=10.1016/S0378-1119(99)00210-3; citation_id=CR5 citation_journal_title=Genomics, Proteomics and Bioinformatics; citation_title=Comparative analysis of transcription start sites using mutual information; citation_author=DA Reddy, CK Mitra; citation_volume=4; citation_issue=3; citation_publication_date=2006; citation_pages=189-195; citation_doi=10.1016/S1672-0229(06)60032-6; citation_id=CR6 citation_journal_title=Computational Biology and Chemistry; citation_title=Comparative analysis of core promoter region: information content from mono and dinucleotide substitution matrices; citation_author=DA Reddy, BVLS Prasad, CK Mitra; citation_volume=30; citation_issue=1; citation_publication_date=2006; citation_pages=58-62; citation_doi=10.1016/j.compbiolchem.2005.10.004; citation_id=CR7 citation_journal_title=Nucleic Acids Research; citation_title=Comparative analysis of orthologous eukaryotic mRNAs: potential hidden functional signals; citation_author=SA Shabalina, AY Ogurtsov, IB Rogozin, EV Koonin, DJ Lipman; citation_volume=32; citation_issue=5; citation_publication_date=2004; citation_pages=1774-1782; citation_doi=10.1093/nar/gkh313; citation_id=CR8 citation_journal_title=Bioinformatics; citation_title=Exploiting the past and the future in protein secondary structure prediction; citation_author=P Baldi, S Brunak, P Frasconi, G Soda, G Pollastri; citation_volume=15; citation_issue=11; citation_publication_date=1999; citation_pages=937-946; citation_doi=10.1093/bioinformatics/15.11.937; citation_id=CR9 citation_journal_title=IEEE Engineering in Medicine and Biology Magazine; citation_title=Should genetics get an information-theoretic education? Genomes as error-correcting codes; citation_author=G Battail; citation_volume=25; citation_issue=1; citation_publication_date=2006; citation_pages=34-45; citation_doi=10.1109/MEMB.2006.1578662; citation_id=CR10 citation_journal_title=Gene; citation_title=ASF/SF2-like maize pre-mRNA splicing factors affect splice site utilization and their transcripts are alternatively spliced; citation_author=H Gao, WJ Gordon-Kamm, LA Lyznik; citation_volume=339; citation_issue=1-2; citation_publication_date=2004; citation_pages=25-37; citation_doi=10.1016/j.gene.2004.06.047; citation_id=CR11 citation_title=Elements of Information Theory; citation_publication_date=1991; citation_id=CR12; citation_author=TM Cover; citation_author=JA Thomas; citation_publisher=John Wiley & Sons citation_title=Resampling Methods; citation_publication_date=2005; citation_id=CR13; citation_author=PI Good; citation_publisher=Birkhäuser citation_title=Randomization, Bootstrap and Monte Carlo Methods in Biology; citation_publication_date=1977; citation_id=CR14; citation_author=B Manly; citation_publisher=Chapman & Hall/CRC citation_title=Testing Statistical Hypotheses; citation_publication_date=2005; citation_id=CR15; citation_author=EL Lehmann; citation_author=JP Romano; citation_publisher=Springer citation_title=Theory of Statistics; citation_publication_date=1995; citation_id=CR16; citation_author=MJ Schervish; citation_publisher=Springer citation_title=Genomic analysis using methods from information theory; citation_inbook_title=Proceedings of IEEE Information Theory Workshop (ITW '04), San Antonio, Tex, USA; citation_publication_date=2004; citation_pages=55-59; citation_id=CR17; citation_author=J Hagenauer; citation_author=Z Dawy; citation_author=B Göbel; citation_author=P Hanus; citation_author=J Mueller citation_journal_title=Proceedings of IEEE International Conference on Communications (ICC '05), Seoul, Korea; citation_title=An approximation to the distribution of finite sample size mutual information estimates; citation_author=B Goebel, Z Dawy, J Hagenauer, JC Mueller; citation_volume=2; citation_publication_date=2005; citation_pages=1102-1106; citation_id=CR18 citation_title=Distribution of mutual information; citation_inbook_title=Advances in Neural Information Processing Systems 14; citation_publication_date=2002; citation_pages=399-406; citation_id=CR19; citation_author=M Hutter; citation_publisher=MIT Press citation_journal_title=Trends in Genetics; citation_title=Regulation of gene expression by alternative untranslated regions; citation_author=TA Hughes; citation_volume=22; citation_issue=3; citation_publication_date=2006; citation_pages=119-122; citation_doi=10.1016/j.tig.2006.01.001; citation_id=CR20 citation_title=Multialphabet coding with separate alphabet description; citation_inbook_title=Proceedings of the International Conference on Compression and Complexity of Sequences, Positano, Italy; citation_publication_date=1997; citation_pages=56-65; citation_id=CR21; citation_author=J Åberg; citation_author=YuM Shtarkov; citation_author=BJM Smeets citation_journal_title=IEEE Transactions on Information Theory; citation_title=Limit results on pattern entropy; citation_author=A Orlitsky, NP Santhanam, K Viswanathan, J Zhang; citation_volume=52; citation_issue=7; citation_publication_date=2006; citation_pages=2954-2964; citation_doi=10.1109/TIT.2006.876351; citation_id=CR22