Textual data compression in computational biology: Algorithmic techniques

Computer Science Review - Tập 6 - Trang 1-25 - 2012
R. Giancarlo1, D. Scaturro1, F. Utro2
1University of Palermo, Dipartimento di Matematica ed Informatica, Via Archirafi 34, 90123 Palermo, Italy
2IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, USA

Tài liệu tham khảo

Giancarlo, 2009, Textual data compression in computational biology: a synopsis, Bioinformatics, 25, 1575, 10.1093/bioinformatics/btp117 Nalbantoglu, 2010, Data compression concepts and algorithms and their applications to bioinformatics, Entropy, 12, 34, 10.3390/e12010034 Cohen, 2004, Bioinformatics an introduction for computer scientists, ACM Computing Surveysl, 36, 122, 10.1145/1031120.1031122 Leleu, 2010, Processing and analyzing ChIP-seq data: from short reads to regulatory interactions, Briefings in Functional Genomics, 10.1093/bfgp/elq022 Pop, 2008, Bioinformatics challenges of new sequencing technology, Trends in Genetics, 24, 142, 10.1016/j.tig.2007.12.006 Flicek, 2009, The need for speed, Genome Biology, 10, 212, 10.1186/gb-2009-10-3-212 Navarro, 2007, Compressed full-text indexes, ACM Computing Surveys, 39, 2, 10.1145/1216370.1216372 Grümbach, 1993, Compression of DNA sequences, 340 Grümbach, 1994, A new challenge for compression algorithms: Genetic sequences, Information Processing & Management, 30, 875, 10.1016/0306-4573(94)90014-0 Cover, 1991 Rissanen, 1979, Arithmetic coding, IBM Journal of Research and Development, 23, 149, 10.1147/rd.232.0149 Rivals, 1996, A guaranteed compression scheme for repetitive DNA sequences, 453 Chen, 2000, A compression algorithm for DNA sequences and its applications in genome comparison, 107 Chen, 2002, DNACompress: fast and effective DNA sequence compression, Bioinformatics, 18, 1696, 10.1093/bioinformatics/18.12.1696 Apostolico, 1998, Some theory and practice of greedy off-line textual substitution, 119 Matsumoto, 2000, Biological sequence compression algorithms, Genome Informatics, 11, 43 Benci, 2004, Dynamical systems and computable information, Discrete and Continuous Dynamical Systems. Series B, 4, 935, 10.3934/dcdsb.2004.4.935 Manzini, 2005, A simple and fast DNA compressor, Software—Practice and Experience, 35, 1397 S. Bao, S. Chen, Z. Jing, R. Ren, A DNA sequence compression algorithm based on LUT and LZ77, CoRR abs/cs/0504100. Behzadi, 2005, DNA compression challenge revisited: a dynamic programming approach, 190 Tabus, 2003, DNA sequence compression using the normalized maximum likelihood model for discrete regression, 253 Hategan, 2004, Protein is compressible, 192 Korodi, 2005, An efficient normalized maximum likelihood algorithm for DNA sequence compression, ACM Transactions on Information Systems, 23, 3, 10.1145/1055709.1055711 Tembe, 2010, G-SQZ: compact encoding of genomic sequence and quality data, Bioinformatics, 10.1093/bioinformatics/btq346 Cao, 2007, A simple statistical algorithm for biological sequence compression, 43 Adjeroh, 2006, On compressibility of protein sequences, 422 Adjeroh, 2002, DNA sequence compression using the Burrows–Wheeler transform, 303 Burrows, 1994, A block-sorting lossless data compression algorithm, Digital Equipment Corporation Gusfield, 1997 Adjeroh, 2003, The SCP and compressed domain analysis of biological sequences, 587 Ziv, 1978, Compression of individual sequences via variable-rate coding, IEEE Transactions on Information Theory, 24, 530, 10.1109/TIT.1978.1055934 Kieffer, 2000, Grammar-based codes: a new class of universal lossless source codes, IEEE Transactions on Information Theory, 46, 737, 10.1109/18.841160 Larsson, 1999, Offline dictionary-based compression, 296 Nevill-Manning, 1997, Compression and explanation using hierarchical grammars, The Computer Journal, 40, 103, 10.1093/comjnl/40.2_and_3.103 Cameron, 1988, Source encoding using syntactic information source models, IEEE Transactions on Information Theory, 34, 843, 10.1109/18.9782 Cook, 1976, Grammatical inference by hill climbing, Information Sciences, 10, 59, 10.1016/0020-0255(76)90061-X Marsh, 1982, Analysis and processing of compact text, 201 Stolcke, 1994, Inducing probabilistic grammars by Bayesian model merging, vol. 862, 106 N. Cherniavsky, R. Ladner, Grammar-based compression of DNA sequences, in: DIMACS Working Group on the Burrows–Wheeler Transform, 2004. Liu, 2008, RNACompress: grammar-based compression and informational complexity measurement of RNA secondary structure, BMC Bioinformatics, 9, 176+, 10.1186/1471-2105-9-176 Higgs, 2000, RNA secondary structure: physical and computational aspects, Journal Quarterly Reviews of Biophysics, 33, 199, 10.1017/S0033583500003620 Korodi, 2007, Compression of annotated nucleotide sequences, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 4, 447, 10.1109/tcbb.2007.1017 Storer, 1988, Data Compression: Methods and Theory Elias, 1975, Universal codeword sets and representations of the integers, IEEE Transactions on Information Theory, 21, 194, 10.1109/TIT.1975.1055349 Golomb, 1965, Run length encoding, IEEE Transactions on Information Theory, 12, 399, 10.1109/TIT.1966.1053907 Huffman, 1952, vol. 40, 1098 Tichy, 1985, RCSa system for version control, Software—Practice and Experience, 15, 637, 10.1002/spe.4380150703 Brandon, 2009, Data structures and compression algorithms for genomic sequence data, Bioinformatics, 10.1093/bioinformatics/btp319 Daily, 2010, Data structures and compression algorithms for high-throughput sequencing technologies, BMC Bioinformatics, 11, 514, 10.1186/1471-2105-11-514 Christley, 2009, Human genomes as email attachments, Bioinformatics, 25, 274, 10.1093/bioinformatics/btn582 Buchsbaum, 2000, Engineering the compression of massive tables: An experimental approach, 175 Buchsbaum, 2003, Improving table compression with combinatorial optimization, Journal of the ACM, 50, 825, 10.1145/950620.950622 Vo, 2004, Using column dependency to compress tables, 92 Vo, 2007, Compressing table data with column dependency, Theoretical Computer Science, 387, 273, 10.1016/j.tcs.2007.07.016 Apostolico, 2008, Table compression by record intersection, 13 Genome, 1000 genome project. Available at: http://www.1000genomes.org/, 2008. White, 2008, Compressing DNA sequence databases with coil, BMC Bioinformatics, 9, 242, 10.1186/1471-2105-9-242 Schneider, 1986, Information content of binding sites on nucleotide sequences, Journal of Molecular Biology, 188, 415, 10.1016/0022-2836(86)90165-8 Gutell, 1992, Identifying constraints on the higher-order structure of RNA: continued development and application of comparative sequence analysis methods, Nucleic Acids Research, 20, 5785, 10.1093/nar/20.21.5785 Li, 1997 Bolshoy, 2003, DNA sequence analysis linguistic tools: contrast vocabularies, compositional spectra and linguistic complexity, Applied Bioinformatics, 2, 103 Konopka, 2005, Information theories in molecular biology and genomics, Nature Encyclopedia of the Human Genome, 3, 464 Lió, 1996, High statistics block entropy measures of DNA sequences, Journal of Theoretical Biology, 180, 151, 10.1006/jtbi.1996.0091 Benedetto, 2007, Compressing proteomes: the relevance of medium range correlations, EURASIP Journal on Bioinformatics and Systems Biology, 2007, 1, 10.1155/2007/60723 Loewenstern, 1999, Significantly lower entropy estimates for natural DNA sequences, Journal of Computational Biology, 6, 125, 10.1089/cmb.1999.6.125 Schmidt, 1997, Estimating the entropy of DNA sequences, Journal of Theoretical Biology, 188, 369, 10.1006/jtbi.1997.0493 Weiss, 1998, Correlations in protein sequences and property codes, Journal of Theoretical Biology, 190, 341, 10.1006/jtbi.1997.0560 Weiss, 2000, Information content of protein sequences, Journal of Theoretical Biology, 206, 379, 10.1006/jtbi.2000.2138 Farach, 1995, On the entropy of DNA: algorithms and measurements based on memory and rapid convergence, 48 Lempel, 1976, On the complexity of finite sequences, IEEE Transactions on Information Theory, 22, 75, 10.1109/TIT.1976.1055501 Rodeh, 1981, Linear algorithm for data compression via string matching, Journal of the ACM, 28, 16, 10.1145/322234.322237 Lanctot, 2000, Estimating DNA sequence entropy, 409 Gusfield, 2002, Suffix trees (and relatives) come of age in bioinformatics, 3 Apostolico, 1985, The myriad virtues of subword trees, 85 Ferragina, 2008, Compressed text indexes: from theory to practice!, ACM Journal of Experimental Algorithmics Sadakane, 2003, New text indexing functionalities of compressed suffix arrays, Journal of Algorithms, 48, 294, 10.1016/S0196-6774(03)00087-7 Mäkinen, 2010, Storage and retrieval of highly repetitive sequence collections, Journal of Computational Biology, 17, 281, 10.1089/cmb.2009.0169 Sadakane, 2001, Indexing huge genome sequences for solving various problems, Genome Informatics, 12, 175 Healy, 2003, Annotating large genomes with exact word matches, Genome Research, 13, 2306, 10.1101/gr.1350803 Ferragina, 2000, Opportunistic data structures with applications, 390 Lippert, 2005, A space-efficient construction of the Burrows–Wheeler transform for genomic data, Journal of Computational Biology, 12, 943, 10.1089/cmb.2005.12.943 Lippert, 2005, Space-efficient whole genome comparisons with Burrows–Wheeler transforms, Journal of Computational Biology, 12, 407, 10.1089/cmb.2005.12.407 Välimäki, 2007, Compressed suffix tree—a basis for genome-scale sequence analysis, Bioinformatics, 23, 629, 10.1093/bioinformatics/btl681 Li, 2010, A survey of sequence alignment algorithms for next-generation sequencing, Briefings in Bioinformatics, 11, 473, 10.1093/bib/bbq015 Langmead, 2009, Ultrafast and memory-efficient alignment of short DNA sequences to the human genome, Genome Biology, 10, R25, 10.1186/gb-2009-10-3-r25 Li, 2009, Fast and accurate short read alignment with Burrows–Wheeler transform, Bioinformatics, 25, 1754, 10.1093/bioinformatics/btp324 Li, 2009, SOAP2: an improved ultrafast tool for short read alignment, Bioinformatics, 25, 1966, 10.1093/bioinformatics/btp336 Strelets, 1995, Compression of protein sequence databases, Computer Applications in Biosciences (CABIOS), 11, 557 1983 Waterman, 1995 Vinga, 2003, Alignment-free sequence comparison: A review, Bioinformatics, 19, 513, 10.1093/bioinformatics/btg005 Brudno, 2003, LAGAN and multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA, Genome Research, 13, 721, 10.1101/gr.926603 Li, 2003, The similarity metric, IEEE Transactions on Information Theory, 50, 3250, 10.1109/TIT.2004.838101 Varré, 1999, Transformation distances: a family of dissimilarity measures based on movements of segments, Bioinformatics, 15, 194, 10.1093/bioinformatics/15.3.194 Lemaitre, 2008, A small trip in the untranquil world of genomes—a survey on the detection and analysis of rearrangment breakpoints, Theoretical Computer Science, 395, 171, 10.1016/j.tcs.2008.01.014 Apostolico, 2010, Maximal words in sequence comparisons based on subword composition, vol. 6060, 34 Epifanio, 2010, Novel combinatorial and information theoretic alignment-free distances for biological data mining, 323 Kemena, 2009, Upcoming challenges for multiple sequence alignment methods in the high-throughput era, Bioinformatics, 25, 2455, 10.1093/bioinformatics/btp452 Kertesz-Farkas, 2009, The application of data compression-based distances to biological sequences, 83 Keogh, 2004, Towards parameter-free data mining, 206 Keogh, 2008, Compression-based data mining, 278 Otu, 2003, A new sequence distance measure for phylogenetic tree construction, Bioinformatics, 19, 2122, 10.1093/bioinformatics/btg295 Zhang, 2009, Normalized Lempel–Ziv complexity and its application in bio-sequence analysis, Journal of Mathematical Chemistry, 46, 1203, 10.1007/s10910-008-9512-2 Bennett, 1998, Information distance, IEEE Transactions on Information Theory, 44, 1407, 10.1109/18.681318 Cilibrasi, 2005, Clustering by compression, IEEE Transactions on Information Theory, 51, 1523, 10.1109/TIT.2005.844059 Ferragina, 2007, Compression-based classification of biological sequences and structures via the universal similarity metric: experimental assessment, BMC Bioinformatics, 8, 252, 10.1186/1471-2105-8-252 Shkarin, 2002, PPM: one step to practicality, 202 Apostolico, 2006, Mining, compressing and classifying with extensible motifs, Algorithms for Molecular Biology, 1, 4, 10.1186/1748-7188-1-4 Bastola, 2004, Utilization of the relative complexity measure to construct a phylogenetic tree for fungi, Mycological Research, 108, 117, 10.1017/S0953756203009079 Li, 2001, An information-based sequence distance and its application to whole mitochondrial genome phylogeny, Bioinformatics, 17, 149, 10.1093/bioinformatics/17.2.149 Rivals, 1996, Compression and genetic sequences analysis, Biochimie, 78, 315, 10.1016/0300-9084(96)84763-8 Rodrigo, 2008, The perils of plenty: what are we going to do with all these genes?, 3893 Weeks, 2007, Evolutionary hierarchies of conserved blocks in 5’-noncoding sequences of dicot rbcS genes, BMC Evolutionary Biology, 7, 51, 10.1186/1471-2148-7-51 Wheeler, 2000, Database resources of the national center for biotechnology information, Nucleic Acids Research, 28, 10, 10.1093/nar/28.1.10 Albayrak, 2010, Clustering of protein families into functional subtypes using relative complexity measure with reduced amino acid alphabets, BMC Bioinformatics, 11, 428, 10.1186/1471-2105-11-428 Gilbert, 2007, Alignment-free comparison of TOPS strings, 177 Kocsor, 2005, Application of compression-based distance measures to protein sequence classification: a methodological study, Bioinformatics, 22, 407, 10.1093/bioinformatics/bti806 Krasnogor, 2004, Measuring the similarity of protein structures by means of the universal similarity metric, Bioinformatics, 20, 1015, 10.1093/bioinformatics/bth031 Liu, 2006, Protein-based phylogenetic analysis by using hydropathy profile of amino acids, FEBS Letters, 580, 5321, 10.1016/j.febslet.2006.08.086 Liu, 2008, Comparison of TOPS strings based on LZ complexity, Journal of Theoretical Biology, 251, 159, 10.1016/j.jtbi.2007.11.016 Pelta, 2005, Protein structure comparison through fuzzy contact maps and the universal similarity metric, 1124 F. Rosselló, J. Rocha, J. Segura, Compression ratios based on the universal similarity metric still yield protein distances far from CATH distances. CoRR abs/q-bio/0603007. Pearl, 2005, The CATH domain structure database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis, Nucleic Acids Research, 33, D247, 10.1093/nar/gki024 Barthel, 2008, ProCKSI: a decision support system for protein (structure) comparison, knowledge, similarity and information, BMC Bioinformatics, 8, 416, 10.1186/1471-2105-8-416 Otu, 2003, A divide-and-conquer approach to fragment assembly, Bioinformatics, 19, 22, 10.1093/bioinformatics/19.1.22 Galas, 2010, Biological information as set-based complexity, IEEE Transactions on Information Theory, 56, 667, 10.1109/TIT.2009.2037046 D.J. Galas, M. Nykter, G.W. Carter, N.D. Price, I. Shmulevich, Set-based complexity and biological information. CoRR abs/0801.4024. Durbin, 1999 Smith, 1981, Identification of common molecular subsequences, Journal of Molecular Biology, 147, 195, 10.1016/0022-2836(81)90087-5 Altshul, 1990, Basic local alignment search tool, Journal of Molecular Biology, 215, 403, 10.1016/S0022-2836(05)80360-2 Viterbi, 1967, Error bounds for convolution codes and an asymptotically optimum decoding algorithm, IEEE Transactions on Information Theory, 13, 260, 10.1109/TIT.1967.1054010 Buchsbaum, 1997, Algorithmic aspects in speech recognition: an introduction, ACM Journal of Experimental Algorithmics, 2, 1, 10.1145/264216.264219 Crochemore, 2003, A sub-quadratic sequence alignment algorithm for unrestricted cost matrices, SIAM Journal on Computing, 32, 1654, 10.1137/S0097539702402007 Giancarlo, 1997, Dynamic programming: special cases, 201 Mozes, 2007, Speeding up HMM decoding and training by exploiting sequence repetitions, 4 Gabriel, 2002, The structure of haplotype blocks in the human genome, Science, 26, 2225, 10.1126/science.1069424 Patil, 2001, Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21, Science, 294, 1719, 10.1126/science.1065573 Anderson, 2003, Finding haplotype block boundaries by using the Minimum-Description-Length principle, The American Journal of Human Genetics, 73, 336, 10.1086/377106 Bockhorst, 2007, Discovering patterns in biological sequences by optimal segmentation, 17 Daly, 2001, High-resolution haplotype structure in the human genome, Nature Genetics, 29, 229, 10.1038/ng1001-229 Koivisto, 2003, An MDL method for finding haplotype blocks and for estimating the strength of haplotype block boundaries, 502 Zhang, 2002, A dynamic programming algorithm for haplotype block partitioning, Proceedings of the National Academy of Sciences of the United States of America, 7335, 10.1073/pnas.102186799 Barron, 1998, The Minimum Description Length principle in coding and modeling, IEEE Transactions on Information Theory, 44, 2743, 10.1109/18.720554 Grünwald, 2005, Minimum Description Length tutorial, 23 Vitányi, 2000, Minimum Description Length induction, Bayesianism, and Kolmogorov complexity, IEEE Transactions on Information Theory, 46, 446, 10.1109/18.825807 Parida, 2007 Reinert, 2005, Statistics on words with applications to biological sequences, vol. 105, 252 Ferreira, 2007, Evaluating protein motif significance measures: a case study on prosite patterns, 34 Milosavljevic, 1993, Discovering simple DNA sequences by the algorithmic significance method, Computer Applications in the Biosciences, 9, 407 Milosavljevic, 1995, Discovering dependencies via algorithmic mutual information: A case study in DNA sequence comparisons, Machine Learning, 21, 35, 10.1007/BF00993378 Powell, 1998, Discovering simple DNA sequences by compression, 597 Aktulga, 2007, Identifying statistical dependence in genomic sequences via mutual information estimates, EURASIP Journal on Bioinformatics and Systems Biology, 2007, 1, 10.1155/2007/14741 Apostolico, 2003, Monotony of surprise and large-scale quest for unusual words, Journal of Computational Biology, 10, 283, 10.1089/10665270360688020 Brāzma, 1996, Discovering patterns and subfamilies in biosequences, 34 Q. Ma, J.T.L. Wang, Evaluating the significance of sequence motifs by the Minimum Description Length principle, 2000. Nevill-Manning, 1997, Enumerating and ranking discrete motifs, 202 Chvátal, 1979, A greedy heuristic for the set-covering problem, Mathematics of Operations Research, 4, 233, 10.1287/moor.4.3.233 Jonassen, 1997, Efficient discovery of conserved patterns using a pattern graph, Computer Applications in the Biosciences, 13, 509 Apostolico, 2004, Motifs in Ziv–Lempel–Welch clef, 72 Apostolico, 2006, Bridging lossy and lossless compression by motif pattern discovery, General Theory of Information Transfer and Combinatorics, 4123, 793, 10.1007/11889342_51 Sharan, 2006, Modeling cellular machinery through biological network comparison, Nature Biotechnology, 24, 427, 10.1038/nbt1196 Zhang, 2008, Biomolecular network querying: a promising approach in systems biology, BMC Systems Biology, 2, 5, 10.1186/1752-0509-2-5 Margolin, 2006, ARACNEa: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context, BMC Bioinformatics, 7, S7, 10.1186/1471-2105-7-S1-S7 Butte, 1999, Unsupervised knowledge discovery in medical databases using relevance networks, 711 Butte, 2000, Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements, 415 Butte, 2000, Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks, Proceedings of the National Academy of Sciences of the United States of America, 12182, 10.1073/pnas.220392197 A. Margolin, N. Banerjee, I. Nemenman, A. Califano, Reverse engineering of the yeast transcriptional network using the ARACNE algorithm, Manuscript. Meyer, 2007, Information-theoretic inference of large transcriptional regulatory networks, EURASIP Journal on Bioinformatics and Systems Biology, 2007, 8, 10.1155/2007/79879 Chor, 2007, Biological networks: Comparison, conservation, and evolution via relative description length, Journal of Computational Biology, 14, 817, 10.1089/cmb.2007.R018 Kanehisa, 2000, KEGG: kyoto encyclopedia of genes and genomes, Nucleic Acids Research, 28, 27, 10.1093/nar/28.1.27 NCBI, NCBI taxonomy database. Available at: www.ncbi.nlm.nih.gov/entrez/linkout/tutorial/taxtour.html, 2007. Sculley, 2006, Compression of DNA sequences Hood, 2003, The digital code of DNA, Nature, 421, 444, 10.1038/nature01410 Ron, 1996, The power of amnesia: learning probabilistic automata with variable memory length, 117 Bejerano, 2001, Variations on probabilistic suffix trees: statistical modeling and prediction of protein families, Bioinformatics, 17, 23, 10.1093/bioinformatics/17.1.23 Apostolico, 2000, Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space, 25 Schulz, 2008, Fast and adaptive variable order Markov chain construction, 306 Ziv, 2008, On finite memory universal data compression and classification of individual sequences, IEEE Transactions on Information Theory, 54, 1626, 10.1109/TIT.2008.917666 Handl, 2005, Computational cluster validation in post-genomic data analysis, Bioinformatics, 21, 3201, 10.1093/bioinformatics/bti517 M. Nykter, O. Yli-Harja, I. Shmulevich, Normalized compression distance for gene expression analysis, in: Proceedings of GENSIPS IEEE International Workshop on Genomic Signal Processing and Statistics, IEEE, 2005, pp. 2–3. Zhou, 2004, Gene clustering based on clusterwide mutual information, Journal of Computational Biology, 11, 147, 10.1089/106652704773416939 Giancarlo, 2010, Distance functions, clustering algorithms and microarray data analysis, vol. 6073 R. Giancarlo, G. Lo Bosco, L. Pinello, F. Utro, The three steps of clustering in the post-genomic era: a synopsis, in: Proc. of CIBB, in: Lecture Notes in Computer Science, 2011, pp. 13–30. Peng, 2005, Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy, IEEE Transactions on Pattern Analysis and Machine Intelligence, 27, 1226, 10.1109/TPAMI.2005.159 Zhou, 2007, Feature selection for microarray data analysis using mutual information and rough set theory, vol. 204, 916 Wang, 2002, An index structure for pattern similarity searching in DNA microarray data, 256 Salzberg, 1998 Szpankowski, 2003, An optimal DNA segmentation based on the MDL principle, 541 Ziv, 1988, On classification with empirically observed statistics and universal data compression, IEEE Transactions on Information Theory, 34, 278, 10.1109/18.2636 Shamir, 2000, Asymptotically optimal low-complexity sequential lossless coding for piecewise-stationary memoryless sources, IEEE Transactions on Information Theory, 46, 2244 Menconi, 2006, A compression-based approach for coding sequences identification in prokaryotic genomes, Journal of Computational Biology, 13, 1477, 10.1089/cmb.2006.13.1477 Menconi, 2004, Sublinear growth of information in DNA sequences, Bulletin of Mathematical Biology, 67, 737, 10.1016/j.bulm.2004.10.005 Madsen, 2008, Short tandem repeats in human exons: a target for disease mutations, BMC Genomics, 9, 410+, 10.1186/1471-2164-9-410 Rivals, 1997, Detection of significant patterns by compression algorithms: The case of approximate tandem repeats in DNA sequences, Computer Applications in the Biosciences, 13, 131 Allison, 1998, Compression of strings with approximate repeats, 8 Dix, 2007, Comparative analysis of long DNA sequences by per element information content using different contexts, BMC Bioinformatics, 8, S10, 10.1186/1471-2105-8-S2-S10 Modegi, 2004, Development of fast tandem repeat analysis and lossless compression method for DNA sequence, P088 Stern, 2001, Discovering patterns in plasmodium falciparum genomic DNA, Molecular & Biochemical Parasitology, 118, 175, 10.1016/S0166-6851(01)00388-7 Allison, 1992, Sequence complexity for biological sequence analysis, Computers & Chemistry, 24, 43, 10.1016/S0097-8485(99)00046-7 Review, Nature reviews collection on microRNAs, Nature Review. doi:10.1038/nrg2202. Evans, 2007, MicroRNA target detection and analysis for genes related to breast cancer using MDLcompress, EURASIP Journal on Bioinformatics and Systems Biology, 2007, 1, 10.1155/2007/43670 D. Loewenstern, H. Hirsh, P.N. Yianilos, M. Noordewier, DNA sequence classification using compression-based induction, Tech. Rep., DIMACS, 1995. J. Abel, Data compression web site. http://www.data-compression.info/, 2002. M. Mahoney, Data compression programs. http://www.cs.fit.edu/~mmahoney/compression/, 2008. G. Manzini, M. Rastero, DNA corpus. http://www.mfn.unipmn.it/~manzini/dnacorpus/index.html, 2005. Nevill-Manning, 1999, Protein is incompressible, 257 M.D. Cao, T.I. Dix, L. Allison, C. Mears, XM software. http://www.csse.monash.edu.au/~lloyd/tildeStrings/Compress/2007DCC/, 2007. Bioinformatics Solutions, Bioinformatics solutions web site. http://www.bioinformaticssolutions.com/products/ph/, 2003. G. Fowler, Pzip home page. http://www.research.att.com/~gsf/man/man1/pzip.html, 2003. K.-P. Vo, Vcodex home page. http://www.research.att.com/~gsf/download/ref/vcodex/vcodex.html, 2002. M.C. Brandon, D.C. Wallace, P. Baldi, ProjectDNACompression home page. http://www.mitomap.org/MITOWIKI/ProjectDNACompression, 2009. S. Christley, Y. Lu, C. Li, X. Xie, DNAzip home page. http://www.ics.uci.edu/~xhx/project/DNAzip, 2010. Li, 2009, GPDP subgroup the sequence alignment/map format and SAMtools, Bioinformatics, 25, 2078, 10.1093/bioinformatics/btp352 W. Tembe, J. Lowey, E. Suh, G-SQZ home page. http://public.tgen.org/sqz, 2010. D. Loewenstern, P.N. Yianilos, CDNA home page. http://pnylab.com/pny/software/cdna/main.html, 1999. R.A. Lippert, C.M. Mobarry, B.P. Walenz, Binary BWT. http://math.mit.edu/~lippert/software/bbbwt/, 2005. N. Välimäki, W. Gerlach, K. Dixit, V. Mäkinen, SuDS genome browser. http://www.cs.helsinki.fi/group/suds/cst/, 2008. P. Ferragina, R. González, G. Navarro, R. Venturini, Pizza & Chili home page. http://pizzachili.dcc.uchile.cl/, 2008. P. Ferragina, R. Giancarlo, V. Greco, G. Manzini, G. Valiente, Kolmogorov library home page. http://www.math.unipa.it/~raffaele/kolmogorov/, 2007. R. Cilibrasi, A.L. Cruz, S. de Rooij, M. Keijzer, CompLearn home page. http://www.complearn.org/, 2005. Gusev, 1999, On the complexity measures of genetic sequences, Bioinformatics, 15, 994, 10.1093/bioinformatics/15.12.994 J.-S. Varré, J.P. Delahaye, É. Rivals, The trasformation distance home page. http://www.lifl.fr/~varre/TD/td.html, 1999. M. Crochemore, S.A. de Carvalho Jr., Sourceforge web site. http://neobio.sourceforge.net/, 2003. Y. Lifshits, S. Mozes, O. Weimann, M. Ziv-Ukelson, Speeding up HMM decoding and training by exploiting sequence repetitions. http://www.cs.brown.edu/~shay/hmmspeedup/hmmspeedup.html, 2008. A.A. Margolin, K. Wang, W.K. Lim, M. Kustagi, I. Nemenman, A. Califano, ARACNE home page. http://amdec-bioinfo.cu-genome.org/html/ARACNE.htm, 2006.