Identification of CpG islands in DNA sequences using statistically optimal null filters
Tóm tắt
CpG dinucleotide clusters also referred to as CpG islands (CGIs) are usually located in the promoter regions of genes in a deoxyribonucleic acid (DNA) sequence. CGIs play a crucial role in gene expression and cell differentiation, as such, they are normally used as gene markers. The earlier CGI identification methods used the rich CpG dinucleotide content in CGIs, as a characteristic measure to identify the locations of CGIs. The fact, that the probability of nucleotide G following nucleotide C in a CGI is greater as compared to a non-CGI, is employed by some of the recent methods. These methods use the difference in transition probabilities between subsequent nucleotides to distinguish between a CGI from a non-CGI. These transition probabilities vary with the data being analyzed and several of them have been reported in the literature sometimes leading to contradictory results. In this article, we propose a new and efficient scheme for identification of CGIs using statistically optimal null filters. We formulate a new CGI identification characteristic to reliably and efficiently identify CGIs in a given DNA sequence which is devoid of any ambiguities. Our proposed scheme combines maximum signal-to-noise ratio and least squares optimization criteria to estimate the CGI identification characteristic in the DNA sequence. The proposed scheme is tested on a number of DNA sequences taken from human chromosomes 21 and 22, and proved to be highly reliable as well as efficient in identifying the CGIs.
Tài liệu tham khảo
Lodish H, Berk A, Zipursky S, Matsudaira P, Baltimore D, Darnell J: Molecular Cell biology. Scientific American, New York,; 1995.
Durbin R, Eddy S, Krogh A, Mitchison G: Biological sequence analysis. Cambridge University Press, Cambridge,; 1998.
Antequera F, Bird A: Number of CpG islands and genes in human and mouse. Proc. Natl Acad. Sci. USA 1993, 90(24):11995-11999. 10.1073/pnas.90.24.11995
Antequera F, Bird A: CpG islands as genomic footprints of promoters that are associated with replication origins. Curr. Biol 1999, 9: 661-667. 10.1016/S0960-9822(99)80290-5
Ioshikhes I, Zhang M: Large-scale human promoter mapping using CpG islands. Nat. Genet 2000, 26: 61-63. 10.1038/79189
Antequera F: Structure, function, evolution of CpG island promoters. Cell. Mol. Life Sci 2003, 60(8):1647-1658. 10.1007/s00018-003-3088-6
Saxonov S, Berg P, Brutlag D: A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters. Proc. Natl Acad. Sci. USA 2006, 103(5):1412-1417. 10.1073/pnas.0510310103
Larsen F, Gundersen G, Lopez R, Prydz H: CpG islands as gene markers in the human genome. Genomics (San Diego, CA) 1992, 13(4):1095-1107.
Wang Y, Leung F: An evaluation of new criteria for CpG islands in the human genome as gene markers. Bioinformatics 2004, 20(7):1170. 10.1093/bioinformatics/bth059
Bird A: DNA methylation patterns and epigenetic memory. Genes Dev 2002, 16: 6-21. 10.1101/gad.947102
Herman J, Baylin S: Gene silencing in cancer in association with promoter hypermethylation. New Engl. J. Med 2003, 349(21):2042. 10.1056/NEJMra023075
Issa J: CpG island methylator phenotype in cancer. Nat. Rev. Cancer 2004, 4(12):988-993. 10.1038/nrc1507
Illingworth R, Kerr A, DeSousa D, Jorgensen H, Ellis P, Stalker J, Jackson D, Clee C, Plumb R, Rogers J: A novel CpG island set identifies tissue-specific methylation at developmental gene loci. PLoS Biol 2008, 6: e22. 10.1371/journal.pbio.0060022
Heisler L, Torti D, Boutros P, Watson J, Chan C, Winegarden N, Takahashi M, Yau P, Huang T, Farnham P: CpG Island microarray probe sequences derived from a physical library are representative of CpG Islands annotated on the human genome. Nucleic Acids Res 2005, 33(9):2952. 10.1093/nar/gki582
Gardiner-Garden M, Frommer M: CpG islands in vertebrate genomes. J. Mol. Biol 1987, 196(2):261. 10.1016/0022-2836(87)90689-9
Rouchka E, Mazzarella R, States David J: Computational detection of CpG islands in DNA, Report: WUCS-97-39. 1997.
Rice P, Longden I, Bleasby A: EMBOSS: the European molecular biology open software suite. Trends Genetics 2000, 16(6):276-277. 10.1016/S0168-9525(00)02024-2
Ponger L, Mouchiroud D: CpGProD: identifying CpG islands associated with transcription start sites in large genomic mammalian sequences. Bioinformatics 2002, 18(4):631. 10.1093/bioinformatics/18.4.631
Dasgupta N, Lin S, Carin L: Sequential modeling for identifying CpG island locations in human genome. IEEE Signal Process. Lett 2002, 9(12):407-409.
Luque-Escamilla P, Martínez-Aroza J, Oliver J, Gómez-Lopera J, Román-Roldán R: Compositional searching of CpG islands in the human genome. Phys. Rev. E 2005, 71(6):61925.
Bock C, Walter J, Paulsen M, Lengauer T: CpG island mapping by epigenome prediction. PLoS Comput. Biol 2007, 3(6):e110. 10.1371/journal.pcbi.0030110
Sujuan Y, Asaithambi A, Liu Y: CpGIF: an algorithm for the identification of CpG islands. Bioinformation 2008, 2(8):335-338. 10.6026/97320630002335
Hackenberg M, Previti C, Luque-Escamilla P, Carpena P, Martínez-Aroza J, Oliver J: CpGcluster: a distance-based algorithm for CpG-island detection. BMC Bioinform 2006, 7: 446. 10.1186/1471-2105-7-446
Takai D, Jones P: Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proc. Natl Acad. Sci 2002, 99(6):3740-3745. 10.1073/pnas.052410099
Yoon B, Vaidyanathan P: Identification of CpG islands using a bank of IIR lowpass filters. In Proceedings of 11 th Digital Signal Processing Workshop. Taos Ski Valley, New Mexico; Aug. 2004.
Rushdi A, Tuqan J: A new DSP-based measure for CpG islands detection. In Digital Signal Processing Workshop, 12th-Signal Processing Education Workshop, 4th. IEEE, Teton National Park, Wyoming; 2006.
Rabiner L: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 1989, 77(2):257-286. 10.1109/5.18626
Won K, Prugel-Bennett A, Krogh A: Evolving the structure of hidden Markov models. IEEE Trans. Evol. Comput 2006, 10: 39-49.
Anastassiou D: Genomic signal processing. IEEE Signal Process. Mag 2001, 18(4):8-20. 10.1109/79.939833
Vaidyanathan P, Yoon B: The role of signal-processing concepts in genomics and proteomics. J. Franklin Inst 2004, 341(1–2):111-135.
Ramachandran P, Antoniou A: Identification of hot-spot locations in proteins using digital filters. IEEE J. Sel. Topics Signal Process 2008, 2(3):378-389.
Rao K, Swamy M: Analysis of genomics and proteomics using DSP techniques. IEEE Trans. Circuits Syst. 1: Regular Papers 2008, 55: 358.
Song N, Yan H: Short exon detection in DNA sequences based on multifeature spectral analysis. EURASIP J. Adv. Signal Process 2011, 2011: 2. 10.1186/1687-6180-2011-2
Liu B: Statistical Genomics: Linkage, Mapping, and QTL Analysis. CRC Press, Boca Raton,; 1998.
Agarwal R, Plotkin E, Swamy M: Statistically optimal null filter based on instantaneous matched processing. Circuits Syst. Signal Process 2001, 20: 37-61. 10.1007/BF01204921
Kakumani R, Devabhaktuni V, Ahmad M: Prediction of protein-coding regions in DNA sequences using a model-based approach. In IEEE International Symposium on Circuits and Systems. Seattle; 2008.
Yadav R, Agarwal R, Swamy M: A new improved model-based seizure detection using statistically optimal null filter. In Engineering in Medicine and Biology Society, 2009. EMBC 2009. Annual International Conference of the IEEE. Minneapolis, Minnesota; 2009.
Voss R: Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phys. Rev. Lett 1992, 68(25):3805-3808. 10.1103/PhysRevLett.68.3805
National Centre for Biotechnology Information http://www.ncbi.nlm.nih.gov
Burset M, Guigo R: Evaluation of gene structure prediction programs. Genomics 1996, 34(3):353-367. 10.1006/geno.1996.0298