Textual data compression in computational biology: Algorithmic techniques
Tài liệu tham khảo
Giancarlo, 2009, Textual data compression in computational biology: a synopsis, Bioinformatics, 25, 1575, 10.1093/bioinformatics/btp117
Nalbantoglu, 2010, Data compression concepts and algorithms and their applications to bioinformatics, Entropy, 12, 34, 10.3390/e12010034
Cohen, 2004, Bioinformatics an introduction for computer scientists, ACM Computing Surveysl, 36, 122, 10.1145/1031120.1031122
Leleu, 2010, Processing and analyzing ChIP-seq data: from short reads to regulatory interactions, Briefings in Functional Genomics, 10.1093/bfgp/elq022
Pop, 2008, Bioinformatics challenges of new sequencing technology, Trends in Genetics, 24, 142, 10.1016/j.tig.2007.12.006
Flicek, 2009, The need for speed, Genome Biology, 10, 212, 10.1186/gb-2009-10-3-212
Navarro, 2007, Compressed full-text indexes, ACM Computing Surveys, 39, 2, 10.1145/1216370.1216372
Grümbach, 1993, Compression of DNA sequences, 340
Grümbach, 1994, A new challenge for compression algorithms: Genetic sequences, Information Processing & Management, 30, 875, 10.1016/0306-4573(94)90014-0
Cover, 1991
Rissanen, 1979, Arithmetic coding, IBM Journal of Research and Development, 23, 149, 10.1147/rd.232.0149
Rivals, 1996, A guaranteed compression scheme for repetitive DNA sequences, 453
Chen, 2000, A compression algorithm for DNA sequences and its applications in genome comparison, 107
Chen, 2002, DNACompress: fast and effective DNA sequence compression, Bioinformatics, 18, 1696, 10.1093/bioinformatics/18.12.1696
Apostolico, 1998, Some theory and practice of greedy off-line textual substitution, 119
Matsumoto, 2000, Biological sequence compression algorithms, Genome Informatics, 11, 43
Benci, 2004, Dynamical systems and computable information, Discrete and Continuous Dynamical Systems. Series B, 4, 935, 10.3934/dcdsb.2004.4.935
Manzini, 2005, A simple and fast DNA compressor, Software—Practice and Experience, 35, 1397
S. Bao, S. Chen, Z. Jing, R. Ren, A DNA sequence compression algorithm based on LUT and LZ77, CoRR abs/cs/0504100.
Behzadi, 2005, DNA compression challenge revisited: a dynamic programming approach, 190
Tabus, 2003, DNA sequence compression using the normalized maximum likelihood model for discrete regression, 253
Hategan, 2004, Protein is compressible, 192
Korodi, 2005, An efficient normalized maximum likelihood algorithm for DNA sequence compression, ACM Transactions on Information Systems, 23, 3, 10.1145/1055709.1055711
Tembe, 2010, G-SQZ: compact encoding of genomic sequence and quality data, Bioinformatics, 10.1093/bioinformatics/btq346
Cao, 2007, A simple statistical algorithm for biological sequence compression, 43
Adjeroh, 2006, On compressibility of protein sequences, 422
Adjeroh, 2002, DNA sequence compression using the Burrows–Wheeler transform, 303
Burrows, 1994, A block-sorting lossless data compression algorithm, Digital Equipment Corporation
Gusfield, 1997
Adjeroh, 2003, The SCP and compressed domain analysis of biological sequences, 587
Ziv, 1978, Compression of individual sequences via variable-rate coding, IEEE Transactions on Information Theory, 24, 530, 10.1109/TIT.1978.1055934
Kieffer, 2000, Grammar-based codes: a new class of universal lossless source codes, IEEE Transactions on Information Theory, 46, 737, 10.1109/18.841160
Larsson, 1999, Offline dictionary-based compression, 296
Nevill-Manning, 1997, Compression and explanation using hierarchical grammars, The Computer Journal, 40, 103, 10.1093/comjnl/40.2_and_3.103
Cameron, 1988, Source encoding using syntactic information source models, IEEE Transactions on Information Theory, 34, 843, 10.1109/18.9782
Cook, 1976, Grammatical inference by hill climbing, Information Sciences, 10, 59, 10.1016/0020-0255(76)90061-X
Marsh, 1982, Analysis and processing of compact text, 201
Stolcke, 1994, Inducing probabilistic grammars by Bayesian model merging, vol. 862, 106
N. Cherniavsky, R. Ladner, Grammar-based compression of DNA sequences, in: DIMACS Working Group on the Burrows–Wheeler Transform, 2004.
Liu, 2008, RNACompress: grammar-based compression and informational complexity measurement of RNA secondary structure, BMC Bioinformatics, 9, 176+, 10.1186/1471-2105-9-176
Higgs, 2000, RNA secondary structure: physical and computational aspects, Journal Quarterly Reviews of Biophysics, 33, 199, 10.1017/S0033583500003620
Korodi, 2007, Compression of annotated nucleotide sequences, IEEE/ACM Transactions on Computational Biology and Bioinformatics, 4, 447, 10.1109/tcbb.2007.1017
Storer, 1988, Data Compression: Methods and Theory
Elias, 1975, Universal codeword sets and representations of the integers, IEEE Transactions on Information Theory, 21, 194, 10.1109/TIT.1975.1055349
Golomb, 1965, Run length encoding, IEEE Transactions on Information Theory, 12, 399, 10.1109/TIT.1966.1053907
Huffman, 1952, vol. 40, 1098
Tichy, 1985, RCSa system for version control, Software—Practice and Experience, 15, 637, 10.1002/spe.4380150703
Brandon, 2009, Data structures and compression algorithms for genomic sequence data, Bioinformatics, 10.1093/bioinformatics/btp319
Daily, 2010, Data structures and compression algorithms for high-throughput sequencing technologies, BMC Bioinformatics, 11, 514, 10.1186/1471-2105-11-514
Christley, 2009, Human genomes as email attachments, Bioinformatics, 25, 274, 10.1093/bioinformatics/btn582
Buchsbaum, 2000, Engineering the compression of massive tables: An experimental approach, 175
Buchsbaum, 2003, Improving table compression with combinatorial optimization, Journal of the ACM, 50, 825, 10.1145/950620.950622
Vo, 2004, Using column dependency to compress tables, 92
Vo, 2007, Compressing table data with column dependency, Theoretical Computer Science, 387, 273, 10.1016/j.tcs.2007.07.016
Apostolico, 2008, Table compression by record intersection, 13
Genome, 1000 genome project. Available at: http://www.1000genomes.org/, 2008.
White, 2008, Compressing DNA sequence databases with coil, BMC Bioinformatics, 9, 242, 10.1186/1471-2105-9-242
Schneider, 1986, Information content of binding sites on nucleotide sequences, Journal of Molecular Biology, 188, 415, 10.1016/0022-2836(86)90165-8
Gutell, 1992, Identifying constraints on the higher-order structure of RNA: continued development and application of comparative sequence analysis methods, Nucleic Acids Research, 20, 5785, 10.1093/nar/20.21.5785
Li, 1997
Bolshoy, 2003, DNA sequence analysis linguistic tools: contrast vocabularies, compositional spectra and linguistic complexity, Applied Bioinformatics, 2, 103
Konopka, 2005, Information theories in molecular biology and genomics, Nature Encyclopedia of the Human Genome, 3, 464
Lió, 1996, High statistics block entropy measures of DNA sequences, Journal of Theoretical Biology, 180, 151, 10.1006/jtbi.1996.0091
Benedetto, 2007, Compressing proteomes: the relevance of medium range correlations, EURASIP Journal on Bioinformatics and Systems Biology, 2007, 1, 10.1155/2007/60723
Loewenstern, 1999, Significantly lower entropy estimates for natural DNA sequences, Journal of Computational Biology, 6, 125, 10.1089/cmb.1999.6.125
Schmidt, 1997, Estimating the entropy of DNA sequences, Journal of Theoretical Biology, 188, 369, 10.1006/jtbi.1997.0493
Weiss, 1998, Correlations in protein sequences and property codes, Journal of Theoretical Biology, 190, 341, 10.1006/jtbi.1997.0560
Weiss, 2000, Information content of protein sequences, Journal of Theoretical Biology, 206, 379, 10.1006/jtbi.2000.2138
Farach, 1995, On the entropy of DNA: algorithms and measurements based on memory and rapid convergence, 48
Lempel, 1976, On the complexity of finite sequences, IEEE Transactions on Information Theory, 22, 75, 10.1109/TIT.1976.1055501
Rodeh, 1981, Linear algorithm for data compression via string matching, Journal of the ACM, 28, 16, 10.1145/322234.322237
Lanctot, 2000, Estimating DNA sequence entropy, 409
Gusfield, 2002, Suffix trees (and relatives) come of age in bioinformatics, 3
Apostolico, 1985, The myriad virtues of subword trees, 85
Ferragina, 2008, Compressed text indexes: from theory to practice!, ACM Journal of Experimental Algorithmics
Sadakane, 2003, New text indexing functionalities of compressed suffix arrays, Journal of Algorithms, 48, 294, 10.1016/S0196-6774(03)00087-7
Mäkinen, 2010, Storage and retrieval of highly repetitive sequence collections, Journal of Computational Biology, 17, 281, 10.1089/cmb.2009.0169
Sadakane, 2001, Indexing huge genome sequences for solving various problems, Genome Informatics, 12, 175
Healy, 2003, Annotating large genomes with exact word matches, Genome Research, 13, 2306, 10.1101/gr.1350803
Ferragina, 2000, Opportunistic data structures with applications, 390
Lippert, 2005, A space-efficient construction of the Burrows–Wheeler transform for genomic data, Journal of Computational Biology, 12, 943, 10.1089/cmb.2005.12.943
Lippert, 2005, Space-efficient whole genome comparisons with Burrows–Wheeler transforms, Journal of Computational Biology, 12, 407, 10.1089/cmb.2005.12.407
Välimäki, 2007, Compressed suffix tree—a basis for genome-scale sequence analysis, Bioinformatics, 23, 629, 10.1093/bioinformatics/btl681
Li, 2010, A survey of sequence alignment algorithms for next-generation sequencing, Briefings in Bioinformatics, 11, 473, 10.1093/bib/bbq015
Langmead, 2009, Ultrafast and memory-efficient alignment of short DNA sequences to the human genome, Genome Biology, 10, R25, 10.1186/gb-2009-10-3-r25
Li, 2009, Fast and accurate short read alignment with Burrows–Wheeler transform, Bioinformatics, 25, 1754, 10.1093/bioinformatics/btp324
Li, 2009, SOAP2: an improved ultrafast tool for short read alignment, Bioinformatics, 25, 1966, 10.1093/bioinformatics/btp336
Strelets, 1995, Compression of protein sequence databases, Computer Applications in Biosciences (CABIOS), 11, 557
1983
Waterman, 1995
Vinga, 2003, Alignment-free sequence comparison: A review, Bioinformatics, 19, 513, 10.1093/bioinformatics/btg005
Brudno, 2003, LAGAN and multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA, Genome Research, 13, 721, 10.1101/gr.926603
Li, 2003, The similarity metric, IEEE Transactions on Information Theory, 50, 3250, 10.1109/TIT.2004.838101
Varré, 1999, Transformation distances: a family of dissimilarity measures based on movements of segments, Bioinformatics, 15, 194, 10.1093/bioinformatics/15.3.194
Lemaitre, 2008, A small trip in the untranquil world of genomes—a survey on the detection and analysis of rearrangment breakpoints, Theoretical Computer Science, 395, 171, 10.1016/j.tcs.2008.01.014
Apostolico, 2010, Maximal words in sequence comparisons based on subword composition, vol. 6060, 34
Epifanio, 2010, Novel combinatorial and information theoretic alignment-free distances for biological data mining, 323
Kemena, 2009, Upcoming challenges for multiple sequence alignment methods in the high-throughput era, Bioinformatics, 25, 2455, 10.1093/bioinformatics/btp452
Kertesz-Farkas, 2009, The application of data compression-based distances to biological sequences, 83
Keogh, 2004, Towards parameter-free data mining, 206
Keogh, 2008, Compression-based data mining, 278
Otu, 2003, A new sequence distance measure for phylogenetic tree construction, Bioinformatics, 19, 2122, 10.1093/bioinformatics/btg295
Zhang, 2009, Normalized Lempel–Ziv complexity and its application in bio-sequence analysis, Journal of Mathematical Chemistry, 46, 1203, 10.1007/s10910-008-9512-2
Bennett, 1998, Information distance, IEEE Transactions on Information Theory, 44, 1407, 10.1109/18.681318
Cilibrasi, 2005, Clustering by compression, IEEE Transactions on Information Theory, 51, 1523, 10.1109/TIT.2005.844059
Ferragina, 2007, Compression-based classification of biological sequences and structures via the universal similarity metric: experimental assessment, BMC Bioinformatics, 8, 252, 10.1186/1471-2105-8-252
Shkarin, 2002, PPM: one step to practicality, 202
Apostolico, 2006, Mining, compressing and classifying with extensible motifs, Algorithms for Molecular Biology, 1, 4, 10.1186/1748-7188-1-4
Bastola, 2004, Utilization of the relative complexity measure to construct a phylogenetic tree for fungi, Mycological Research, 108, 117, 10.1017/S0953756203009079
Li, 2001, An information-based sequence distance and its application to whole mitochondrial genome phylogeny, Bioinformatics, 17, 149, 10.1093/bioinformatics/17.2.149
Rivals, 1996, Compression and genetic sequences analysis, Biochimie, 78, 315, 10.1016/0300-9084(96)84763-8
Rodrigo, 2008, The perils of plenty: what are we going to do with all these genes?, 3893
Weeks, 2007, Evolutionary hierarchies of conserved blocks in 5’-noncoding sequences of dicot rbcS genes, BMC Evolutionary Biology, 7, 51, 10.1186/1471-2148-7-51
Wheeler, 2000, Database resources of the national center for biotechnology information, Nucleic Acids Research, 28, 10, 10.1093/nar/28.1.10
Albayrak, 2010, Clustering of protein families into functional subtypes using relative complexity measure with reduced amino acid alphabets, BMC Bioinformatics, 11, 428, 10.1186/1471-2105-11-428
Gilbert, 2007, Alignment-free comparison of TOPS strings, 177
Kocsor, 2005, Application of compression-based distance measures to protein sequence classification: a methodological study, Bioinformatics, 22, 407, 10.1093/bioinformatics/bti806
Krasnogor, 2004, Measuring the similarity of protein structures by means of the universal similarity metric, Bioinformatics, 20, 1015, 10.1093/bioinformatics/bth031
Liu, 2006, Protein-based phylogenetic analysis by using hydropathy profile of amino acids, FEBS Letters, 580, 5321, 10.1016/j.febslet.2006.08.086
Liu, 2008, Comparison of TOPS strings based on LZ complexity, Journal of Theoretical Biology, 251, 159, 10.1016/j.jtbi.2007.11.016
Pelta, 2005, Protein structure comparison through fuzzy contact maps and the universal similarity metric, 1124
F. Rosselló, J. Rocha, J. Segura, Compression ratios based on the universal similarity metric still yield protein distances far from CATH distances. CoRR abs/q-bio/0603007.
Pearl, 2005, The CATH domain structure database and related resources Gene3D and DHS provide comprehensive domain family information for genome analysis, Nucleic Acids Research, 33, D247, 10.1093/nar/gki024
Barthel, 2008, ProCKSI: a decision support system for protein (structure) comparison, knowledge, similarity and information, BMC Bioinformatics, 8, 416, 10.1186/1471-2105-8-416
Otu, 2003, A divide-and-conquer approach to fragment assembly, Bioinformatics, 19, 22, 10.1093/bioinformatics/19.1.22
Galas, 2010, Biological information as set-based complexity, IEEE Transactions on Information Theory, 56, 667, 10.1109/TIT.2009.2037046
D.J. Galas, M. Nykter, G.W. Carter, N.D. Price, I. Shmulevich, Set-based complexity and biological information. CoRR abs/0801.4024.
Durbin, 1999
Smith, 1981, Identification of common molecular subsequences, Journal of Molecular Biology, 147, 195, 10.1016/0022-2836(81)90087-5
Altshul, 1990, Basic local alignment search tool, Journal of Molecular Biology, 215, 403, 10.1016/S0022-2836(05)80360-2
Viterbi, 1967, Error bounds for convolution codes and an asymptotically optimum decoding algorithm, IEEE Transactions on Information Theory, 13, 260, 10.1109/TIT.1967.1054010
Buchsbaum, 1997, Algorithmic aspects in speech recognition: an introduction, ACM Journal of Experimental Algorithmics, 2, 1, 10.1145/264216.264219
Crochemore, 2003, A sub-quadratic sequence alignment algorithm for unrestricted cost matrices, SIAM Journal on Computing, 32, 1654, 10.1137/S0097539702402007
Giancarlo, 1997, Dynamic programming: special cases, 201
Mozes, 2007, Speeding up HMM decoding and training by exploiting sequence repetitions, 4
Gabriel, 2002, The structure of haplotype blocks in the human genome, Science, 26, 2225, 10.1126/science.1069424
Patil, 2001, Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21, Science, 294, 1719, 10.1126/science.1065573
Anderson, 2003, Finding haplotype block boundaries by using the Minimum-Description-Length principle, The American Journal of Human Genetics, 73, 336, 10.1086/377106
Bockhorst, 2007, Discovering patterns in biological sequences by optimal segmentation, 17
Daly, 2001, High-resolution haplotype structure in the human genome, Nature Genetics, 29, 229, 10.1038/ng1001-229
Koivisto, 2003, An MDL method for finding haplotype blocks and for estimating the strength of haplotype block boundaries, 502
Zhang, 2002, A dynamic programming algorithm for haplotype block partitioning, Proceedings of the National Academy of Sciences of the United States of America, 7335, 10.1073/pnas.102186799
Barron, 1998, The Minimum Description Length principle in coding and modeling, IEEE Transactions on Information Theory, 44, 2743, 10.1109/18.720554
Grünwald, 2005, Minimum Description Length tutorial, 23
Vitányi, 2000, Minimum Description Length induction, Bayesianism, and Kolmogorov complexity, IEEE Transactions on Information Theory, 46, 446, 10.1109/18.825807
Parida, 2007
Reinert, 2005, Statistics on words with applications to biological sequences, vol. 105, 252
Ferreira, 2007, Evaluating protein motif significance measures: a case study on prosite patterns, 34
Milosavljevic, 1993, Discovering simple DNA sequences by the algorithmic significance method, Computer Applications in the Biosciences, 9, 407
Milosavljevic, 1995, Discovering dependencies via algorithmic mutual information: A case study in DNA sequence comparisons, Machine Learning, 21, 35, 10.1007/BF00993378
Powell, 1998, Discovering simple DNA sequences by compression, 597
Aktulga, 2007, Identifying statistical dependence in genomic sequences via mutual information estimates, EURASIP Journal on Bioinformatics and Systems Biology, 2007, 1, 10.1155/2007/14741
Apostolico, 2003, Monotony of surprise and large-scale quest for unusual words, Journal of Computational Biology, 10, 283, 10.1089/10665270360688020
Brāzma, 1996, Discovering patterns and subfamilies in biosequences, 34
Q. Ma, J.T.L. Wang, Evaluating the significance of sequence motifs by the Minimum Description Length principle, 2000.
Nevill-Manning, 1997, Enumerating and ranking discrete motifs, 202
Chvátal, 1979, A greedy heuristic for the set-covering problem, Mathematics of Operations Research, 4, 233, 10.1287/moor.4.3.233
Jonassen, 1997, Efficient discovery of conserved patterns using a pattern graph, Computer Applications in the Biosciences, 13, 509
Apostolico, 2004, Motifs in Ziv–Lempel–Welch clef, 72
Apostolico, 2006, Bridging lossy and lossless compression by motif pattern discovery, General Theory of Information Transfer and Combinatorics, 4123, 793, 10.1007/11889342_51
Sharan, 2006, Modeling cellular machinery through biological network comparison, Nature Biotechnology, 24, 427, 10.1038/nbt1196
Zhang, 2008, Biomolecular network querying: a promising approach in systems biology, BMC Systems Biology, 2, 5, 10.1186/1752-0509-2-5
Margolin, 2006, ARACNEa: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context, BMC Bioinformatics, 7, S7, 10.1186/1471-2105-7-S1-S7
Butte, 1999, Unsupervised knowledge discovery in medical databases using relevance networks, 711
Butte, 2000, Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements, 415
Butte, 2000, Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks, Proceedings of the National Academy of Sciences of the United States of America, 12182, 10.1073/pnas.220392197
A. Margolin, N. Banerjee, I. Nemenman, A. Califano, Reverse engineering of the yeast transcriptional network using the ARACNE algorithm, Manuscript.
Meyer, 2007, Information-theoretic inference of large transcriptional regulatory networks, EURASIP Journal on Bioinformatics and Systems Biology, 2007, 8, 10.1155/2007/79879
Chor, 2007, Biological networks: Comparison, conservation, and evolution via relative description length, Journal of Computational Biology, 14, 817, 10.1089/cmb.2007.R018
Kanehisa, 2000, KEGG: kyoto encyclopedia of genes and genomes, Nucleic Acids Research, 28, 27, 10.1093/nar/28.1.27
NCBI, NCBI taxonomy database. Available at: www.ncbi.nlm.nih.gov/entrez/linkout/tutorial/taxtour.html, 2007.
Sculley, 2006, Compression of DNA sequences
Hood, 2003, The digital code of DNA, Nature, 421, 444, 10.1038/nature01410
Ron, 1996, The power of amnesia: learning probabilistic automata with variable memory length, 117
Bejerano, 2001, Variations on probabilistic suffix trees: statistical modeling and prediction of protein families, Bioinformatics, 17, 23, 10.1093/bioinformatics/17.1.23
Apostolico, 2000, Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space, 25
Schulz, 2008, Fast and adaptive variable order Markov chain construction, 306
Ziv, 2008, On finite memory universal data compression and classification of individual sequences, IEEE Transactions on Information Theory, 54, 1626, 10.1109/TIT.2008.917666
Handl, 2005, Computational cluster validation in post-genomic data analysis, Bioinformatics, 21, 3201, 10.1093/bioinformatics/bti517
M. Nykter, O. Yli-Harja, I. Shmulevich, Normalized compression distance for gene expression analysis, in: Proceedings of GENSIPS IEEE International Workshop on Genomic Signal Processing and Statistics, IEEE, 2005, pp. 2–3.
Zhou, 2004, Gene clustering based on clusterwide mutual information, Journal of Computational Biology, 11, 147, 10.1089/106652704773416939
Giancarlo, 2010, Distance functions, clustering algorithms and microarray data analysis, vol. 6073
R. Giancarlo, G. Lo Bosco, L. Pinello, F. Utro, The three steps of clustering in the post-genomic era: a synopsis, in: Proc. of CIBB, in: Lecture Notes in Computer Science, 2011, pp. 13–30.
Peng, 2005, Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy, IEEE Transactions on Pattern Analysis and Machine Intelligence, 27, 1226, 10.1109/TPAMI.2005.159
Zhou, 2007, Feature selection for microarray data analysis using mutual information and rough set theory, vol. 204, 916
Wang, 2002, An index structure for pattern similarity searching in DNA microarray data, 256
Salzberg, 1998
Szpankowski, 2003, An optimal DNA segmentation based on the MDL principle, 541
Ziv, 1988, On classification with empirically observed statistics and universal data compression, IEEE Transactions on Information Theory, 34, 278, 10.1109/18.2636
Shamir, 2000, Asymptotically optimal low-complexity sequential lossless coding for piecewise-stationary memoryless sources, IEEE Transactions on Information Theory, 46, 2244
Menconi, 2006, A compression-based approach for coding sequences identification in prokaryotic genomes, Journal of Computational Biology, 13, 1477, 10.1089/cmb.2006.13.1477
Menconi, 2004, Sublinear growth of information in DNA sequences, Bulletin of Mathematical Biology, 67, 737, 10.1016/j.bulm.2004.10.005
Madsen, 2008, Short tandem repeats in human exons: a target for disease mutations, BMC Genomics, 9, 410+, 10.1186/1471-2164-9-410
Rivals, 1997, Detection of significant patterns by compression algorithms: The case of approximate tandem repeats in DNA sequences, Computer Applications in the Biosciences, 13, 131
Allison, 1998, Compression of strings with approximate repeats, 8
Dix, 2007, Comparative analysis of long DNA sequences by per element information content using different contexts, BMC Bioinformatics, 8, S10, 10.1186/1471-2105-8-S2-S10
Modegi, 2004, Development of fast tandem repeat analysis and lossless compression method for DNA sequence, P088
Stern, 2001, Discovering patterns in plasmodium falciparum genomic DNA, Molecular & Biochemical Parasitology, 118, 175, 10.1016/S0166-6851(01)00388-7
Allison, 1992, Sequence complexity for biological sequence analysis, Computers & Chemistry, 24, 43, 10.1016/S0097-8485(99)00046-7
Review, Nature reviews collection on microRNAs, Nature Review. doi:10.1038/nrg2202.
Evans, 2007, MicroRNA target detection and analysis for genes related to breast cancer using MDLcompress, EURASIP Journal on Bioinformatics and Systems Biology, 2007, 1, 10.1155/2007/43670
D. Loewenstern, H. Hirsh, P.N. Yianilos, M. Noordewier, DNA sequence classification using compression-based induction, Tech. Rep., DIMACS, 1995.
J. Abel, Data compression web site. http://www.data-compression.info/, 2002.
M. Mahoney, Data compression programs. http://www.cs.fit.edu/~mmahoney/compression/, 2008.
G. Manzini, M. Rastero, DNA corpus. http://www.mfn.unipmn.it/~manzini/dnacorpus/index.html, 2005.
Nevill-Manning, 1999, Protein is incompressible, 257
M.D. Cao, T.I. Dix, L. Allison, C. Mears, XM software. http://www.csse.monash.edu.au/~lloyd/tildeStrings/Compress/2007DCC/, 2007.
Bioinformatics Solutions, Bioinformatics solutions web site. http://www.bioinformaticssolutions.com/products/ph/, 2003.
G. Fowler, Pzip home page. http://www.research.att.com/~gsf/man/man1/pzip.html, 2003.
K.-P. Vo, Vcodex home page. http://www.research.att.com/~gsf/download/ref/vcodex/vcodex.html, 2002.
M.C. Brandon, D.C. Wallace, P. Baldi, ProjectDNACompression home page. http://www.mitomap.org/MITOWIKI/ProjectDNACompression, 2009.
S. Christley, Y. Lu, C. Li, X. Xie, DNAzip home page. http://www.ics.uci.edu/~xhx/project/DNAzip, 2010.
Li, 2009, GPDP subgroup the sequence alignment/map format and SAMtools, Bioinformatics, 25, 2078, 10.1093/bioinformatics/btp352
W. Tembe, J. Lowey, E. Suh, G-SQZ home page. http://public.tgen.org/sqz, 2010.
D. Loewenstern, P.N. Yianilos, CDNA home page. http://pnylab.com/pny/software/cdna/main.html, 1999.
R.A. Lippert, C.M. Mobarry, B.P. Walenz, Binary BWT. http://math.mit.edu/~lippert/software/bbbwt/, 2005.
N. Välimäki, W. Gerlach, K. Dixit, V. Mäkinen, SuDS genome browser. http://www.cs.helsinki.fi/group/suds/cst/, 2008.
P. Ferragina, R. González, G. Navarro, R. Venturini, Pizza & Chili home page. http://pizzachili.dcc.uchile.cl/, 2008.
P. Ferragina, R. Giancarlo, V. Greco, G. Manzini, G. Valiente, Kolmogorov library home page. http://www.math.unipa.it/~raffaele/kolmogorov/, 2007.
R. Cilibrasi, A.L. Cruz, S. de Rooij, M. Keijzer, CompLearn home page. http://www.complearn.org/, 2005.
Gusev, 1999, On the complexity measures of genetic sequences, Bioinformatics, 15, 994, 10.1093/bioinformatics/15.12.994
J.-S. Varré, J.P. Delahaye, É. Rivals, The trasformation distance home page. http://www.lifl.fr/~varre/TD/td.html, 1999.
M. Crochemore, S.A. de Carvalho Jr., Sourceforge web site. http://neobio.sourceforge.net/, 2003.
Y. Lifshits, S. Mozes, O. Weimann, M. Ziv-Ukelson, Speeding up HMM decoding and training by exploiting sequence repetitions. http://www.cs.brown.edu/~shay/hmmspeedup/hmmspeedup.html, 2008.
A.A. Margolin, K. Wang, W.K. Lim, M. Kustagi, I. Nemenman, A. Califano, ARACNE home page. http://amdec-bioinfo.cu-genome.org/html/ARACNE.htm, 2006.