Estimating mutual information using B-spline functions – an improved similarity measure for analysing gene expression data
Tóm tắt
The information theoretic concept of mutual information provides a general framework to evaluate dependencies between variables. In the context of the clustering of genes with similar patterns of expression it has been suggested as a general quantity of similarity to extend commonly used linear measures. Since mutual information is defined in terms of discrete variables, its application to continuous data requires the use of binning procedures, which can lead to significant numerical errors for datasets of small or moderate size. In this work, we propose a method for the numerical estimation of mutual information from continuous data. We investigate the characteristic properties arising from the application of our algorithm and show that our approach outperforms commonly used algorithms: The significance, as a measure of the power of distinction from random correlation, is significantly increased. This concept is subsequently illustrated on two large-scale gene expression datasets and the results are compared to those obtained using other similarity measures. A C++ source code of our algorithm is available for non-commercial use from [email protected] upon request. The utilisation of mutual information as similarity measure enables the detection of non-linear correlations in gene expression datasets. Frequently applied linear correlation measures, which are often used on an ad-hoc basis without further justification, are thereby extended.
Tài liệu tham khảo
Schena M, Shalon D, Davis RW, Brown PO: Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray. Science 1995, 270: 467–470.
Velculescu VE, Zhang I, Vogelstein B, Kinzler K: Serial Analysis of Gene Expression. Science 1995, 270: 484–487.
Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998, 95: 14863–14868. 10.1073/pnas.95.25.14863
D'haeseleer P, Weng X, Fuhrman S, Somogyi R: Information processing in cells and tissues. Plenum Publishing 1997, 203–212. [http://www.cs.unm.edu/~patrik/networks/IPCAT/ipcat.html]
D'haeseleer P, Liang S, Somogyi R: Genetic network inference: from co-expression clustering to reverse engineering. Bioinformatics 2000, 16: 707–726. 10.1093/bioinformatics/16.8.707
Michaels GS, Carr DB, Askenazi M, Fuhrmann S, Wen X, Somogyi R: Cluster analysis and data visualization of large-scale gene expression data. Pac Symp Biocomput 1998, 42–53.
Butte AJ, Kohane IS: Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac Symp Biocomput 2000, 5: 427–439.
Herwig R, Poustka AJ, Muller C, Bull C, Lehrach H, O'brien J: Large-scale clustering of cDNA-fingerprinting data. Genome Res 1999, 9: 1093–1105. 10.1101/gr.9.11.1093
Korber BT, Farber RM, Wolpert DH, Lapedes AS: Covariation of mutations in the V3 loop of human immunodeficiency virus type 1 envelope protein: An information theoretic analysis. Proc Natl Acad Sci USA 1993, 90: 7176–7180.
Gorodkin J, Heyer LJ, Brunak S, Stormo GD, Wen X, Somogyi R: Display the information contents of structural RNA alignments: the structure logos. Comput Appl Biosci 1997, 13: 583–586.
Liang S, Fuhrman S, Somogyi R: Reveal, a general reverse engineering algorithm for inference of genetic network architectures. Pac Symp Biocomput 1998, 18–29.
Roberts S, Everson R: Independent component analysis: Priciples and Practice. Cambridge: Cambridge University Press 2001.
Hyvärinen A, Karhunne J, Oja E: Independent component analysis. New York: Wiley 2001.
Fraser AM, Swinney HL: Independent coordinates for strange attractors from mutual information. Phys Rev A 1986, 33: 2318–2321. 10.1103/PhysRevA.33.1134
Thénevaz P, Unser M: Optimization of mutual information for multiresolution image registration. IEEE Trans Image Processing 2000, 9: 2083–2099. 10.1109/83.887976
Ellis DP, Bilmes JA: Using mutual information to design feature combinations. In Proceedings of the International Conference on Spoken Language Processing: Beijing 16–20 October 2000 [http://www.icsi.berkeley.edu/ftp/global/pub/speech/papers/icslp00-cmi.pdf]
Shannon CE: A mathematical theory of communication. The Bell System Technical Journal 1948, 27: 623–656.
Moon Y, Rajagopalan B, Lall U: Estimation of mutual information using kernel density estimators. Phys Rev E 1995, 52: 2318–2321. 10.1103/PhysRevE.52.2318
Silverman BW: Density estimation for statistics and data analysis. London: Chapman and Hall 1986.
Steuer R, Kurths J, Daub CO, Weise J, Selbig J: The mutual information: detecting end evaluating dependencies between variables. Bioinformatics 2002, (Suppl.2):S231-S240.
Paninski L: Estimation of Entropy and Mutual Information. Neural Computation 2003, 15: 1191–1253. 10.1162/089976603321780272
DeBoor C: A practical guide to splines. New York: Springer 1978.
Unser M, Aldroubi A, Eden M: B-spline signal processing: Part 1 – Theory. IEEE Trans Signal Precessing 1993, 41: 821–832. 10.1109/78.193220
Unser M, Aldroubi A, Eden M: B-spline signal processing: Part 2 – Efficient design and applications. IEEE Trans Signal Precessing 1993, 41: 834–848. 10.1109/78.193221
Herzel H, Schmidt AO, Ebeling W: Finite sample effects in sequence analysis. Chaos, Solitons & Fractals 1994, 4: 97–113. 10.1016/0960-0779(94)90020-5
Herzel H, Grosse I: Measuring correlations in symbol sequences. Physica A 1995, 216: 518–542. 10.1016/0378-4371(95)00104-F
Grosse I: Estimating entropies from finite samples. In Dynamik, Evolution, Strukturen (Edited by: Freund JA). Berlin: Dr. Köster 1996, 181–190.
Roulston MS: Estimating the error on measured entropy and mutual information. Physica D 1999, 125: 285–294. 10.1016/S0167-2789(98)00269-3
Herzel H, Grosse I: Correlations in DNA sequences: The role of protein coding segments. Phy Rev E 1997, 55: 800–810. 10.1103/PhysRevE.55.800
Klus GT, Song A, Schick A, Wahde M, Szallasi Z: Mutual Information Analysis as a Tool to Assess the Role of Aneuploidy in the Generation of Cancer-Associated Differential Gene Expression Patterns. Pac Symp Biocomput 2001, 42–51.
Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, Coffey HA, Dai H, He YD, Kidd MJ, King AM, Meyer MR, Slade D, Lum PY, Stepaniants SB, Shoemaker DD, Gachotte D, Chakraburtty K, Simon J, Bard M, Friend SH: Functional Discovery via a Compendium of Expression Profiles. Cell 2000, 102: 109–126. 10.1016/S0092-8674(00)00015-5
Steuer R, Daub CO, Selbig J, Kurths J: Measuring distances between variables by mutual information. In Proceedings of the 27th Annual Conference of the Gesellschaft für Klassifikation: Cottbus, in press. 12–14 March 2003
He YD, Dai H, Schadt EE, Cavet G, Edwards SW, Stepaniants SB, Duenwald S, Kleinhanz R, Jones AR, Shoemaker DD, Stoughton RB: Microarray standard data set and figures of merit for comparing data processing methods and experiment design. Bioinformatics 2003, 19: 956–965. 10.1093/bioinformatics/btg126