GSVA: gene set variation analysis for microarray and RNA-Seq data
Tóm tắt
Gene set enrichment (GSE) analysis is a popular framework for condensing information from gene expression profiles into a pathway or signature summary. The strengths of this approach over single gene analysis include noise and dimension reduction, as well as greater biological interpretability. As molecular profiling experiments move beyond simple case-control studies, robust and flexible GSE methodologies are needed that can model pathway activity within highly heterogeneous data sets. To address this challenge, we introduce Gene Set Variation Analysis (GSVA), a GSE method that estimates variation of pathway activity over a sample population in an unsupervised manner. We demonstrate the robustness of GSVA in a comparison with current state of the art sample-wise enrichment methods. Further, we provide examples of its utility in differential pathway activity and survival analysis. Lastly, we show how GSVA works analogously with data from both microarray and RNA-seq experiments. GSVA provides increased power to detect subtle pathway activity changes over a sample population in comparison to corresponding methods. While GSE methods are generally regarded as end points of a bioinformatic analysis, GSVA constitutes a starting point to build pathway-centric models of biology. Moreover, GSVA contributes to the current need of GSE methods for RNA-seq data. GSVA is an open source software package for R which forms part of the Bioconductor project and can be downloaded at
http://www.bioconductor.org
.
Tài liệu tham khảo
Goeman JJ, Geer SAvd, Kort Fd, Houwelingen HCv: A global test for groups of genes: testing association with a clinical outcome. Bioinformatics. 2004, 20: 93-99. [http://bioinformatics.oxfordjournals.org/content/20/1/93.abstract]
Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstråle M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC: PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genet. 2003, 34 (3): 267-273. [http://www.ncbi.nlm.nih.gov/pubmed/12808457]
Sweet-Cordero A, Mukherjee S, Subramanian A, You H, Roix JJ, Ladd-Acosta C, Mesirov J, Golub TR, Jacks T: An oncogenic KRAS2 expression signature identified by cross-species gene-expression analysis. Nature Gen. 2005, 37: 48-55. [http://www.ncbi.nlm.nih.gov/pubmed/15608639]
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005, 102 (43): 15545-15550. [http://www.pnas.org/content/102/43/15545.abstract]
Barbie DA, Tamayo P, Boehm JS, Kim SY, Moody SE, Dunn IF, Schinzel AC, Sandy P, Meylan E, Scholl C, Fröhling S, Chan EM, Sos ML, Michel K, Mermel C, Silver SJ, Weir BA, Reiling JH, Sheng Q, Gupta PB, Wadlow RC, Le H, Hoersch S, Wittner BS, Ramaswamy S, Livingston DM, Sabatini DM, Meyerson M, Thomas RK, Lander ES, Mesirov JP, Root DE, Gilliland DG, Jacks T, Hahn WC: Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature. 2009, 462 (7269): 108-112. [http://www.nature.com/nature/journal/v462/n7269/abs/nature08460.html]
Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ: Discovering statistically significant pathways in expression profiling studies. Proc Natl Acad Sci USA. 2005, 102 (38): 13544-13549. [http://www.pnas.org/content/102/38/13544]
Barry WT, Nobel AB, Wright FA: Significance analysis of functional categories in gene expression studies: a structured permutation approach. Bioinformatics. 2005, 21 (9): 1943-1949. [http://www.ncbi.nlm.nih.gov/pubmed/15647293]
Efron B, Tibshirani R: On testing the significance of sets of genes. Ann Appl Stat. 2006, 1 (1): 107-129. [http://arxiv.org/abs/math/0610667]
Dørum G, Snipen L, Solheim M, Sæbø S: Rotation testing in gene set enrichment analysis for small direct comparison experiments. Stat Apps Gen Mol Bio. 2009, 8: [http://www.bepress.com/sagmb/vol8/iss1/art34]
Irizarry RA, Wang C, Zhou Y, Speed TP: Gene set enrichment analysis made simple. Stat Methods Med Res. 2009, 18 (6): 565-575. [http://smm.sagepub.com/content/18/6/565.abstract]
Jiang Z, Gentleman R: Extensions to gene set enrichment. Bioinformatics. 2007, 23 (3): 306-313. [http://bioinformatics.oxfordjournals.org/content/23/3/306.abstract]
Wu D, Lim E, Vaillant F, Asselin-Labat ML, Visvader JE, Smyth GK: ROAST: rotation gene set tests for complex microarray experiments. Bioinformatics (Oxford, England). 2010, 26 (17): 2176-2182. [http://www.ncbi.nlm.nih.gov/pubmed/20610611]. [PMID: 20610611]
Lamb J, Ramaswamy S, Ford HL, Contreras B, Martinez RV, Kittrell FS, Zahnow CA, Patterson N, Golub TR, Ewen ME: A mechanism of cyclin D1 action encoded in the patterns of gene expression in human cancer. Cell. 2003, 114 (3): 323-334. [http://www.cell.com/abstract/S0092-8674(03)00570-1]
Shepard JL, Amatruda JF, Stern HM, Subramanian A, Finkelstein D, Ziai J, Finley KR, Pfaff KL, Hersey C, Zhou Y, Barut B, Freedman M, Lee C, Spitsbergen J, Neuberg D, Weber G, Golub TR, Glickman JN, Kutok JL, Aster JC, Zon LI: A zebrafish bmyb mutation causes genome instability and increased cancer susceptibility. Proc Natl Acad Sci USA. 2005, 102 (37): 13194-13199. [http://www.pnas.org/content/102/37/13194.abstract]
Segrè AV, Groop L, Mootha VK, Daly MJ, Altshuler D, Consortium D: Common inherited variation in Mitochondrial genes is not enriched for associations with Type 2 diabetes or related glycemic Traits. PLoS Genet. 2010, 6 (8): e1001058-[http://dx.doi.org/10.1371/journal.pgen.1001058]
Pece S, Tosoni D, Confalonieri S, Mazzarol G, Vecchi M, Ronzoni S, Bernard L, Viale G, Pelicci PG, Fiore PPD: Biological and molecular heterogeneity of breast cancers correlates with their cancer stem cell content. Cell. 2010, 140: 62-73. [http://www.sciencedirect.com/science/article/B6WSN-4Y3TDSF-D/2/9fd74fc1accc422d7a6e6d935b45975c]
Hung JH, Yang TH, Hu Z, Weng Z, DeLisi C: Gene set enrichment analysis: performance evaluation and usage guidelines. Brief Bioinformatics. 2012, 13 (3): 281-291. [http://www.ncbi.nlm.nih.gov/pubmed/21900207]. [PMID: 21900207]
Goeman JJ, Bühlmann P: Analyzing gene expression data in terms of gene sets: methodological issues. Bioinformatics (Oxford, England). 2007, 23 (8): 980-987. [http://www.ncbi.nlm.nih.gov/pubmed/17303618]. [PMID: 17303618]
Kim SY, Volsky DJ: PAGE: Parametric analysis of gene set enrichment. BMC Bioinformatics. 2005, 6: 144-[PMID: 15941488 PMCID: 1183189]
Tenenbaum JD, Walker MG, Utz PJ, Butte AJ: Expression-based Pathway Signature Analysis (EPSA): Mining publicly available microarray data for insight into human disease. BMC Med Genomics. 2008, 1: 51-[http://www.biomedcentral.com/1755-8794/1/51]
Creighton CJ: Multiple oncogenic pathway signatures show coordinate expression patterns in human prostate tumors. PLoS One. 2008, 3 (3): e1816-[http://dx.doi.org/10.1371/journal.pone.0001816]
Lee E, Chuang HY, Kim JW, Ideker T, Lee D: Inferring pathway activity toward precise disease classification. PLoS Comput Biol. 2008, 4 (11): e1000217-[http://dx.doi.org/10.1371/journal.pcbi.1000217]
Zilliox MJ, Irizarry RA: A gene expression bar code for microarray data. Nat Meth. 2007, 4 (11): 911-913. [http://dx.doi.org/10.1038/nmeth1102]
Hansen KD, Irizarry RA, Wu Z: Removing technical variability in RNA-seq data using conditional quantile normalization. Biostatistics. 2012, [http://biostatistics.oxfordjournals.org/content/early/2012/01/24/biostatistics.kxr054.abstract]
Silverman BW: Density Estimation for Statistics and Data Analysis. 1986, London: Chapman and Hall, [http://www.crcpress.com/product/isbn/9780412246203]. [ISBN 9780412246203]
Canale A, Dunson DB: Bayesian kernel mixtures for counts. J Am Stat Assoc. 2011, 106 (496): 1528-1539.
Edelman E, Porrello A, Guinney J, Balakumaran B, Bild A, Febbo PG, Mukherjee S: Analysis of sample set enrichment scores: assaying the enrichment of sets of genes for individual samples in genome-wide expression profiles. Bioinformatics. 2006, 22 (14): e108-e116. [http://www.ncbi.nlm.nih.gov/pubmed/16873460]
Verhaak RGW, Hoadley KA, Purdom E, Wang V, Qi Y, Wilkerson MD, Miller CR, Ding L, Golub T, Mesirov JP, Alexe G, Lawrence M, O’Kelly M, Tamayo P, Weir BA, Gabriel S, Winckler W, Gupta S, Jakkula L, Feiler HS, Hodgson JG, James CD, Sarkaria JN, Brennan C, Kahn A, Spellman PT, Wilson RK, Speed TP, Gray JW, Meyerson M, Getz G, Perou CM, Hayes DN: Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell. 2010, 17: 98-110. [http://www.ncbi.nlm.nih.gov/pubmed/20129251]
Pearson E: Comparison of tests for randomness of points on a line. Biometrika. 1963, 50: 315-325.
Tamayo P, Steinhardt G, Liberzon A, Mesirov JP: Gene set enrichment analysis made right. arXiv:1110.4128. 2011, [http://arxiv.org/abs/1110.4128]
Khatri P, Drăghici S: Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics. 2005, 21 (18): 3587-3595. [http://bioinformatics.oxfordjournals.org/content/21/18/3587]
Nam D, Kim SY: Gene-set approach for expression pattern analysis. Brief Bioinformatics. 2008, 9 (3): 189-197. [http://bib.oxfordjournals.org/content/9/3/189]
Huang DW, Sherman BT, Lempicki RA: Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009, 37: 1-13. [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2615629/]. [PMID: 19033363 PMCID: PMC2615629]
Jung K, Becker B, Brunner E, Beißbarth T: Comparison of global tests for functional gene sets in two-group designs and selection of potentially effect-causing genes. Bioinformatics. 2011, 27 (10): 1377-1383. [http://bioinformatics.oxfordjournals.org/content/27/10/1377]
Tomfohr J, Lu J, Kepler TB: Pathway level analysis of gene expression using singular value decomposition. BMC Bioinformatics. 2005, 6: 225-[http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1261155/]. [PMID: 16156896 PMCID: PMC1261155]
Bair E, Tibshirani R: Semi-supervised methods to predict patient survival from gene expression data. PLoS Biol. 2004, 2 (4): [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC387275/]. [PMID: 15094809 PMCID: PMC387275]
Armstrong SA, Staunton JE, Silverman LB, Pieters R, Boer MLd, Minden MD, Sallan SE, Lander ES, Golub TR, Korsmeyer SJ: MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature Gen. 2002, 30: 41-47. [http://www.ncbi.nlm.nih.gov/pubmed/11731795]
Smyth GK: Linear models and empirical Bayes Methods for assessing differential expression in microarray experiments. Stat Appl Gen Mol Biol. 2004, 3: [http://www.bepress.com/sagmb/vol3/iss1/art3]
Hubert L, Arabie P: Comparing partitions. J Classif. 1985, 2: 193-218. [http://www.springerlink.com/content/x64124718341j1j0/abstract/]
Network TCGAR: Integrated genomic analyses of ovarian carcinoma. Nature. 2011, 474 (7353): 609-615. [http://www.ncbi.nlm.nih.gov/pubmed/21720365]. [PMID: 21720365]
Soprano KJ, Purev E, Vuocolo S, Soprano DR: Rb2/p130 and protein phosphatase 2A: key mediators of ovarian carcinoma cell growth suppression by all-trans retinoic acid. Oncogene. 2006, 25 (38): 5315-5325. [http://www.ncbi.nlm.nih.gov/pubmed/16936753]. [PMID: 16936753]
Um SJ, Lee SY, Kim EJ, Han HS, Koh YM, Hong KJ, Sin HS, Park JS: Antiproliferative mechanism of retinoid derivatives in ovarian cancer cells. Cancer Letters. 2001, 174 (2): 127-134. [http://www.sciencedirect.com/science/article/pii/S0304383501006978]
Forbes SA, Bindal N, Bamford S, Cole C, Kok CY, Beare D, Jia M, Shepherd R, Leung K, Menzies A, Teague JW, Campbell PJ, Stratton MR, Futreal PA: COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res. 2010, 39 (Database): D945-D950. [http://nar.oxfordjournals.org/content/39/suppl_1/D945.long]
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Meth. 2008, 5 (7): 621-628. [http://dx.doi.org/10.1038/nmeth.1226]
Robinson MD, McCarthy DJ, Smyth GK: edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010, 26: 139-140. [http://bioinformatics.oxfordjournals.org/content/26/1/139.short]
Wu D, Lim E, Vaillant F, Asselin-Labat ML, Visvader JE, Smyth GK: ROAST: rotation gene set tests for complex microarray experiments. Bioinformatics (Oxford, England). 2010, 26 (17): 2176-2182. [http://www.ncbi.nlm.nih.gov/pubmed/20610611]. [PMID: 20610611]
Alexa A, Rahnenführer J, Lengauer T: Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics. 2006, 22 (13): 1600-1607. [http://bioinformatics.oxfordjournals.org/content/22/13/1600.abstract]
Young MD, Wakefield MJ, Smyth GK, Oshlack A: Gene ontology analysis for RNA-seq: accounting for selection bias. Genome Biol. 2010, 11 (2): R14-[http://www.ncbi.nlm.nih.gov/pubmed/20132535]. [PMID: 20132535]
Michaud J, Simpson KM, Escher R, Buchet-Poyau K, Beissbarth T, Carmichael C, Ritchie ME, Schütz F, Cannon P, Liu M, Shen X, Ito Y, Raskind WH, Horwitz MS, Osato M, Turner DR, Speed TP, Kavallaris M, Smyth GK, Scott HS: Integrative analysis of RUNX1 downstream pathways and target genes. BMC Genomics. 2008, 9: 363-[http://www.biomedcentral.com/1471-2164/9/363/abstract]
Khatri P, Sirota M, Butte AJ: Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol. 2012, 8 (2): e1002375-[http://dx.doi.org/10.1371/journal.pcbi.1002375]
Huang RS, Duan S, Bleibel WK, Kistner EO, Zhang W, Clark TA, Chen TX, Schweitzer AC, Blume JE, Cox NJ, Dolan ME: A genome-wide approach to identify genetic variants that contribute to etoposide-induced cytotoxicity. Proc Natl Acad Sci USA. 2007, 104 (23): 9758-9763. [http://www.ncbi.nlm.nih.gov/pubmed/17537913]. [PMID: 17537913]
Pickrell JK, Marioni JC, Pai AA, Degner JF, Engelhardt BE, Nkadori E, Veyrieras JB, Stephens M, Gilad Y, Pritchard JK: Understanding mechanisms underlying human gene expression variation with RNA sequencing. Nature. 2010, 464 (7289): 768-772. [http://dx.doi.org/10.1038/nature08872]
Carrel L, Willard HF: X-inactivation profile reveals extensive variability in X-linked gene expression in females. Nature. 2005, 434 (7031): 400-404. [http://www.ncbi.nlm.nih.gov/pubmed/15772666]. [PMID: 15772666]
Skaletsky H, Kuroda-Kawaguchi T, Minx PJ, Cordum HS, Hillier L, Brown LG, Repping S, Pyntikova T, Ali J, Bieri T, Chinwalla A, Delehaunty A, Delehaunty K, Du H, Fewell G, Fulton L, Fulton R, Graves T, Hou SF, Latrielle P, Leonard S, Mardis E, Maupin R, Miner T, Nash W, Nguyen C, Ozersky P, Pepin K, Rock S, Rohlfing T, Scott K, Schultz B, Strong C, Tin-Wollam A, Yang SP, Waterston RH, Wilson RK, Rozen S, Page DC, McPherson J: The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes. Nature. 2003, 423 (6942): 825-837. [http://dx.doi.org/10.1038/nature01722]
Sing T, Sander O, Beerenwinkel N, Lengauer T: ROCR: visualizing classifier performance in R. Bioinformatics. 2005, 21 (20): 3940-3941.
Giordano TJ, Kuick R, Else T, Gauger PG, Vinco M, Bauersfeld J, Sanders D, Thomas DG, Doherty G, Hammer G: Molecular classification and prognostication of adrenocortical tumors by transcriptome profiling. Clin Cancer Res: Official J Am Assoc Cancer Res. 2009, 15 (2): 668-676. [http://www.ncbi.nlm.nih.gov/pubmed/19147773]. [PMID: 19147773]
Johnson WE, Li C, Rabinovic A: Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007, 8: 118-127.
Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P, Leal SM, Pasternak S, Wheeler DA, Willis TD, Yu F, Yang H, Zeng C, Gao Y, Hu H, Hu W, Li C, Lin W, Liu S, Pan H, Tang X, Wang J, Wang W, Yu J, Zhang B, Zhang Q, Zhao H: A second generation human haplotype map of over 3.1 million SNPs. Nature. 2007, 449 (7164): 851-861. [http://www.nature.com/nature/journal/v449/n7164/abs/nature06258.html]
Team RDC: R: A Language and Environment for Statistical Computing. 2010, Vienna: R Foundation for Statistical Computing, [http://www.R-project.org]. [ISBN 3-900051-07-0]
Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5 (10): R80-[http://www.ncbi.nlm.nih.gov/pubmed/15461798]. [PMID: 15461798]
Bolstad BM: Low-level analysis of high-density oligonucleotide array data: background, normalization and summarization. PhD thesis,. University of Waikato 2004. [http://bmbolstad.com/Dissertation/Bolstad_2004_Dissertation.pdf]
Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 2003, 31 (4): e15
Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc. 1995, 57: 289-300.
Bourgon R, Gentleman R, Huber W: Independent filtering increases detection power for high-throughput experiments. Proc Natl Acad Sci. 2010, 107 (21): 9546-9551. [http://www.pnas.org/content/107/21/9546]
Levine DM, Haynor DR, Castle JC, Stepaniants SB, Pellegrini M, Mao M, Johnson JM: Pathway and gene-set activation measurement from mRNA expression data: the tissue distribution of human pathways. Genome Biol. 2006, 7 (10): R93-[http://www.ncbi.nlm.nih.gov/pubmed/17044931]. [PMID: 17044931]
Parts L, Stegle O, Winn J, Durbin R: Joint genetic analysis of gene expression data with inferred cellular phenotypes. PLoS Genet. 2011, 7: e1001276-[http://dx.doi.org/10.1371/journal.pgen.1001276]
Schadt EE, Lamb J, Yang X, Zhu J, Edwards S, Guhathakurta D, Sieberts SK, Monks S, Reitman M, Zhang C, Lum PY, Leonardson A, Thieringer R, Metzger JM, Yang L, Castle J, Zhu H, Kash SF, Drake TA, Sachs A, Lusis AJ: An integrative genomics approach to infer causal associations between gene expression and disease. Nature Genet. 2005, 37: 710-717.