From correlation to causation networks: a simple approximate learning algorithm and its application to high-dimensional plant gene expression data

BMC Systems Biology - Tập 1 - Trang 1-10 - 2007
Rainer Opgen-Rhein1, Korbinian Strimmer2
1Department of Statistics, Ludwig-Maximilians-Universität, München, Germany
2Institute for Medical Informatics, Statistics, and Epidemiology (IMISE), University of Leipzig, Leipzig, Germany

Tóm tắt

The use of correlation networks is widespread in the analysis of gene expression and proteomics data, even though it is known that correlations not only confound direct and indirect associations but also provide no means to distinguish between cause and effect. For "causal" analysis typically the inference of a directed graphical model is required. However, this is rather difficult due to the curse of dimensionality. We propose a simple heuristic for the statistical learning of a high-dimensional "causal" network. The method first converts a correlation network into a partial correlation graph. Subsequently, a partial ordering of the nodes is established by multiple testing of the log-ratio of standardized partial variances. This allows identifying a directed acyclic causal network as a subgraph of the partial correlation network. We illustrate the approach by analyzing a large Arabidopsis thaliana expression data set. The proposed approach is a heuristic algorithm that is based on a number of approximations, such as substituting lower order partial correlations by full order partial correlations. Nevertheless, for small samples and for sparse networks the algorithm not only yield sensible first order approximations of the causal structure in high-dimensional genomic data but is also computationally highly efficient. The method is implemented in the "GeneNet" R package (version 1.2.0), available from CRAN and from http://strimmerlab.org/software/genets/ . The software includes an R script for reproducing the network analysis of the Arabidopsis thaliana data.

Tài liệu tham khảo

Mantegna RN, Stanley HE: An Introduction to Econophysics: Correlations and Complexity in Finance. 2000, Cambridge, UK: Cambridge University Press Onnela JP, Kaski K, Kertész J: Clustering and information in correlation based financial networks. Eur Phys J B. 2004, 38: 353-362. 10.1140/epjb/e2004-00128-7. Boginski V, Butenko S, Pardalos PM: Statistical analysis of financial networks. Comp Stat Data Anal. 2005, 48: 431-443. 10.1016/j.csda.2004.02.004. Shipley B: Cause and Correlation in Biology. 2000, Cambridge University Press Butte AJ, Tamayo P, Slonim D, Golub TR, Kohane IS: Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc Natl Acad Sci USA. 2000, 97: 12182-12186. Oldham M, Horvath S, Geschwind D: Conservation and evolution of gene coexpression networks in human and chimpanzee brains. Proc Natl Acad Sci USA. 2006, 17973-17978. Steuer R: On the analysis and interpretation of correlations in metabolomic data. Brief Bioinform. 2006, 151: 151-158. 10.1093/bib/bbl009. Tumminello M, Aste T, Di Matteo T, Mantegna RN: A tool for filtering information in complex systems. Proc Natl Acad Sc USA. 2005, 102: 10421-10426. 10.1073/pnas.0500298102. Pearl J: Causality: Models, Reasoning, and Inference. 2000, Cambridge, UK: Cambridge University Press Freedman DA: Statistical Models: Theory and Practice. 2005, Cambridge, UK: Cambridge University Press Wermuth N: Linear recursive equations, covariance selection, and path analysis. J Amer Statist Assoc. 1980, 75: 963-972. 10.2307/2287189. Schachter RD, Kenley CR: Gaussian influence diagrams. Management Sci. 1989, 35: 527-550. Tsamardinos I, Brown LE, Aliferis CF: The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning. 2006, 65: 31-78. 10.1007/s10994-006-6889-7. Spirtes P, Glymour C, Scheines R: Causation, Prediction, and Search. 2000, MIT Press, 2 Kalisch M, Bühlmann P: Estimating high-dimensional directed acyclic graphs with the PC-algorithm. J Machine Learn Res. 2007, 8: 613-636. Shimizu S, Hoyer PO, Hyvärinen A, Kerminen A: A linear non-Gaussian acyclic model for causal discovery. J Machine Learn Res. 2006, 7: 2003-2030. de la Fuente A, Bing N, Hoeschele I, Mendes P: Discovery of meaningful associations in genomic data using partial correlation coefficients. Bioinformatics. 2004, 20: 3565-3574. Dobra A, Hans C, Jones B, Nevins JR, Yao G, West M: Sparse graphical models for exploring gene expression data. J Multiv Anal. 2004, 90: 196-212. 10.1016/j.jmva.2004.02.009. Schäfer J, Strimmer K: An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics. 2005, 21: 754-764. Schäfer J, Strimmer K: A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Statist Appl Genet Mol Biol. 2005, 4: 32- Wille A, Bühlmann P: Low-order conditional independence graphs for inferring genetic networks. Statist Appl Genet Mol Biol. 2006, 5: 1- Li H, Gui J: Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks. Biostatistics. 2006, 7: 302-317. Cox DR, Wermuth N: Linear dependencies represented by chain graphs. Statistical Science. 1993, 8: 204-218. Whittaker J: Graphical Models in Applied Multivariate Statistics. 1990, New York: Wiley Studený M: Probabilistic Conditional Independence Structures. 2005, Springer Stewart GW: Collinearity and least squares regression (with discussion). Statist Sci. 1987, 2: 68-100. Opgen-Rhein R, Strimmer K: Inferring gene dependency networks from genomic longitudinal data: a functional data approach. REVSTAT. 2006, 4: 53-65. Efron B: Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J Amer Statist Assoc. 2004, 99: 96-104. 10.1198/016214504000000089. Fisher RA: On a distribution yielding the error functions of several well known statistics. Proc Intl Congr Math. 1924, 2: 805-813. Werhli AV, Grzegorczyk M, Husmeier D: Comparative evaluation of reverse engineering gene regulatory networks with relevance networks, graphical Gaussian models and Bayesian networks. Bioinformatics. 2006, 22: 2523-2531. Castelo R, Roverato A: A robust procedure for Gaussian graphical model search from microarray data with p larger than n. J Machine Learn Res. 2006, 7 Grömping U: Relative importance in linear regression in R: the package relaimpo. J Statist Soft. 2006, 17: 1- Bollen KA: Structural Equations With Latent Variables. 1989, John Wiley & Sons Chickering DM: Learning equivalence classes of Bayesian-network structures. J Machine Learn Res. 2002, 2: 445-498. 10.1162/153244302760200696. Smith SM, Fulton DC, Chia T, Thorneycroft D, Chapple A, Dunstan H, Hylton C, Smith SCZAM: Diurnal changes in the transcriptom encoding enzymes of starch metabolism provide evidence for both transcriptionaland posttranscriptional regulation of starch metabolism inArabidopsis leaves. Plant Physiol. 2004, 136: 2687-2699. Opgen-Rhein R, Strimmer K: Learning causal networks from systems biology time course data: an effective model selection procedure for the vector autoregressive process. BMC Bioinformatics. 2007, 8 (Suppl 2): S3- NASCArrays: the Nottingham Arabidopsis Stock Centre's microarray database. http://affymetrix.arabidopsis.info/narrays/experimentbrowse.pl Wichert S, Fokianos K, Strimmer K: Identifying periodically expressed transcripts in microarray time series data. Bioinformatics. 2004, 20: 5-20. Opgen-Rhein R, Strimmer K: Using regularized dynamic correlation to infer gene dependency networks from time-series microarray data. Proceedings of the 4th International Workshop on Computational Systems Biology (WCSB 2006), Tampere. 2006, 4: 73-76. Schäfer J, Opgen-Rhein R, Strimmer K: Reverse engineering genetic networks using the "GeneNet" package. R News. 2006, 6/5: 50-53. Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabási A-L: Hierarchical organsation of modularity in metabolic networks. Science. 2002, 297: 1551-1555. Barabási AL, Oltvai ZN: Network biology: understanding the cell's functional organization. Nature Rev Genetics. 2004, 5: 101-113. 10.1038/nrg1272.