WGCNA: an R package for weighted correlation network analysis

BMC Bioinformatics - Tập 9 - Trang 1-13 - 2008
Peter Langfelder1, Steve Horvath2
1Department of Human Genetics, University of California, Los Angeles, USA
2Department of Human Genetics and Department of Biostatistics, University of California, Los Angeles, USA

Tóm tắt

Correlation networks are increasingly being used in bioinformatics applications. For example, weighted gene co-expression network analysis is a systems biology method for describing the correlation patterns among genes across microarray samples. Weighted correlation network analysis (WGCNA) can be used for finding clusters (modules) of highly correlated genes, for summarizing such clusters using the module eigengene or an intramodular hub gene, for relating modules to one another and to external sample traits (using eigengene network methodology), and for calculating module membership measures. Correlation networks facilitate network based gene screening methods that can be used to identify candidate biomarkers or therapeutic targets. These methods have been successfully applied in various biological contexts, e.g. cancer, mouse genetics, yeast genetics, and analysis of brain imaging data. While parts of the correlation network methodology have been described in separate publications, there is a need to provide a user-friendly, comprehensive, and consistent software implementation and an accompanying tutorial. The WGCNA R software package is a comprehensive collection of R functions for performing various aspects of weighted correlation network analysis. The package includes functions for network construction, module detection, gene selection, calculations of topological properties, data simulation, visualization, and interfacing with external software. Along with the R package we also present R software tutorials. While the methods development was motivated by gene expression data, the underlying data mining approach can be applied to a variety of different settings. The WGCNA package provides R functions for weighted correlation network analysis, e.g. co-expression network analysis of gene expression data. The R package along with its source code and additional material are freely available at http://www.genetics.ucla.edu/labs/horvath/CoexpressionNetwork/Rpackages/WGCNA .

Tài liệu tham khảo

Fisher RA: On the 'probable error' of a coefficient of correlation deduced from a small sample. Metron 1915, 1: 1–32. Zhou X, Kao MC, Wong W: Transitive Functional Annotation by Shortest-path Analysis of Gene Expression Data. Proc Natl Acad Sci USA 2002, 99(20):12783–12788. Steffen M, Petti A, Aach J, D'haeseleer P, Church G: Automated modelling of signal transduction networks. BMC Bioinformatics 2002, 3: 34. Stuart JM, Segal E, Koller D, Kim SK: A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules. Science 2003, 302(5643):249–255. Zhang B, Horvath S: A General Framework for Weighted Gene Co-expression Network Analysis. Stat Appl Genet Mol Biol 2005, 4: Article 17. Carey VJ, Gentry J, Whalen E, Gentleman R: Network structures and algorithms in Bioconductor. Bioinformatics 2005, 21: 135–136. Schaefer J, Strimmer K: An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics 2005, 21(6):754–764. Chuang CL, Jen CH, Chen CM, Shieh GS: A pattern recognition approach to infer time-lagged genetic interactions. Bioinformatics 2008, 24(9):1183–1190. Cokus S, Rose S, Haynor D, Gronbech-Jensen N, Pellegrini M: Modelling the network of cell cycle transcription factors in the yeast Saccharomyces cerevisiae. BMC Bioinformatics 2006, 7: 381. Horvath S, Zhang B, Carlson M, Lu K, Zhu S, Felciano R, Laurance M, Zhao W, Shu Q, Lee Y, Scheck A, Liau L, Wu H, Geschwind D, Febbo P, Kornblum H, Cloughesy T, Nelson S, Mischel P: Analysis of Oncogenic Signaling Networks in Glioblastoma Identifies ASPM as a Novel Molecular Target. Proc Natl Acad Sci USA 2006, 103(46):17402–17407. Horvath S, Dong J: Geometric interpretation of Gene Co-expression Network Analysis. PLoS Computational Biology 2008. Langfelder P, Horvath S: Eigengene networks for studying the relationships between co-expression modules. BMC Systems Biology 2007, 1: 54. Carlson MR, Zhang B, Fang Z, Horvath S, Mishel PS, Nelson SF: Gene Connectivity, Function, and Sequence Conservation: Predictions from Modular Yeast Co-expression Networks. BMC Genomics 2006., 7(40): Ghazalpour A, Doss S, Zhang B, Plaisier C, Wang S, Schadt E, Thomas A, Drake T, Lusis A, Horvath S: Integrating Genetics and Network Analysis to Characterize Genes Related to Mouse Weight. PloS Genetics 2006, 2(8):e130. Fuller T, Ghazalpour A, Aten J, Drake T, Lusis A, Horvath S: Weighted Gene Co-expression Network Analysis Strategies Applied to Mouse Weight. Mammalian Genome 2007, 6(18):463–472. Emilsson V, Thorleifsson G, Zhang B, Leonardson A, Zink F, Zhu J, Carlson S, Helgason A, Walters G, Gunnarsdottir S, Mouy M, Steinthorsdottir V, Eiriksdottir G, Bjornsdottir G, Reynisdottir I, Gudbjartsson D, Helgadottir A, Jonasdottir A, Jonasdottir A, Styrkarsdottir U, Gretarsdottir S, Magnusson K, Stefansson H, Fossdal R, Kristjansson K, Gislason H, Stefansson T, Leifsson B, Thorsteinsdottir U, Lamb J, Gulcher MJ, Reitman , Kong A, Schadt E, Stefansson K: Genetics of gene expression and its effect on disease. Nature 2008, 452(7186):423–8. van Nas A, Guhathakurta D, Wang S, Yehya S, Horvath S, Zhang B, Ingram Drake L, Chaudhuri G, Schadt E, Drake T, Arnold A, Lusis A: Elucidating the Role of Gonadal Hormones in Sexually Dimorphic Gene Co-Expression Networks. Endocrinology 2008. Oldham M, Horvath S, Geschwind D: Conservation and Evolution of Gene Co-expression Networks in Human and Chimpanzee Brains. Proc Natl Acad Sci USA 2006, 103(47):17973–17978. Miller JA, Oldham MC, Geschwind DH: A Systems Level Analysis of Transcriptional Changes in Alzheimer's Disease and Normal Aging. J Neurosci 2008, 28(6):1410–1420. Oldham MC, Konopka G, Iwamoto K, Langfelder P, Kato T, Horvath S, Geschwind DH: Functional organization of the transcriptome in human brain. Nature Neuroscience 2008, 11(11):1271–1282. Keller MP, Choi Y, Wang P, Belt Davis D, Rabaglia ME, Oler AT, Stapleton DS, Argmann C, Schueler KL, Edwards S, Steinberg HA, Chaibub Neto E, Kleinhanz R, Turner S, Hellerstein MK, Schadt EE, Yandell BS, Kendziorski C, Attie AD: A gene expression network model of type 2 diabetes links cell cycle regulation in islets with diabetes susceptibility. Genome Res 2008, 18(5):706–716. Presson A, Sobel E, Papp J, Suarez C, Whistler T, Rajeevan M, Vernon S, Horvath S: Integrated weighted gene co-expression network analysis with an application to chronic fatigue syndrome. BMC Systems Biology 2008., 2(95): Weston D, Gunter L, Rogers A, Wullschleger S: Connecting genes, coexpression modules, and molecular signatures to environmental stress phenotypes in plants. BMC Systems Biology 2008., 2: Wilcox RR: Introduction to Robust Estimation and Hypothesis Testing. Academic Press; 1997. Yip A, Horvath S: Gene network interconnectedness and the generalized topological overlap measure. BMC Bioinformatics 2007, 8: 22. Ravasz E, Somera A, Mongru D, Oltvai Z, Barabási A: Hierarchical Organization of Modularity in Metabolic Networks. Science 2002, 297(5586):1551–1555. Li A, Horvath S: Network Neighborhood Analysis With the Multi-node Topological Overlap Measure. Bioinformatics 2007, 23(2):222–231. Kaufman L, Rousseeuw P: Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley & Sons, Inc; 1990. Langfelder P, Zhang B, Horvath S: Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics 2008, 24(5):719–720. Dudoit S, Fridlyand J: A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 2002, 3(7):RESEARCH0036. Hastie T, Tibshirani R, Sherlock G, Eisen M, Brown P, Botstein D: Imputing Missing Data for Gene Expression Arrays. Technical report, Stanford Statistics Department 1999. Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB: Missing value estimation methods for DNA microarrays. Bioinformatics 2001, 17(6):520–525. Dong J, Horvath S: Understanding network concepts in modules. BMC Systems Biology 2007, 1: 24. Watts DJ, Strogatz SH: Collective dynamics of 'small-world' networks. Nature 1998, 393(6684):440–2. Dudoit S, Yang Y, Callow M, Speed T: Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Statistica Sinica 2002. (2006) FDT: fields: Tools for Spatial Data. Tech. rep., National Center for Atmospheric Research, Boulder, CO 2007. [http://www.image.ucar.edu/GSP/Software/Fields] Hu Z, Snitkin ES, DeLisi C: VisANT: an integrative framework for networks in systems biology. Brief Bioinform 2008, 9(4):317–325. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks. Genome Research 2003, 13(11):2498–2504. Frohlich H, Speer N, Poustka A, BeiSZbarth T: GOSim – an R-package for computation of information theoretic GO similarities between terms and gene products. BMC Bioinformatics 2007., 8: Dennis G, Sherman B, Hosack D, Yang J, Gao W, Lane H, Lempicki R: DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol 2003, 4(5):P3. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene Ontology: tool for the unification of biology. Nat Genet 2000, 25: 25–29. Zhang B, Kirov S, Snoddy J: WebGestalt: an integrated system for exploring gene sets in various biological contexts. Nucleic Acids Res 2005, 33(Web Server issue):W741-W748. Liu M, Liberzon A, Kong SW, Lai WR, Park PJ, Kohane IS, Kasif S: Network-Based Analysis of Affected Biological Processes in Type 2 Diabetes Models. PLoS Genet 2007, 3(6):e96. Henegar C, Clement K, Zucker JD: Unsupervised Multiple-Instance Learning for Functional Profiling of Genomic Data. In Machine Learning: ECML 2006. Springer Berlin/Heidelberg; 2006:186–197. Gentleman R, Huber W, Carey V, Irizarry R, Dudoit S: Bioinformatics and Computational Biology Solutions Using R and Bioconductor. In Book. Springer-Verlag New York; 2005. Opgen-Rhein R, Strimmer K: From correlation to causation networks: a simple approximate learning algorithm and its application to high-dimensional plant gene expression data. BMC Systems Biology 2007., 1: Aten J, Fuller T, Lusis A, Horvath S: Using genetic markers to orient the edges in quantitative trait networks: The NEO software. BMC Systems Biology 2008., 2: Chaibub Neto E, Ferrara CT, Attie AD, Yandell BS: Inferring Causal Phenotype Networks From Segregating Populations. Genetics 2008, 179(2):1089–1100.