Bayesian Joint Analysis of Gene Expression Data and Gene Functional Annotations

Statistics in Biosciences - Tập 4 - Trang 300-318 - 2012
Xinlei Wang1, Min Chen2, Arkady B. Khodursky3, Guanghua Xiao2
1Department of Statistical Science, Southern Methodist University, Dallas, USA
2Division of Biostatistics, Department of Clinical Sciences, The University of Texas Southwestern Medical Center at Dallas, Dallas, USA
3Department of Biochemistry, Molecular Biology and Biophysics, The University of Minnesota, St. Paul, USA

Tóm tắt

Identifying which genes and which gene sets are differentially expressed (DE) under two experimental conditions are both key questions in microarray analysis. Although closely related and seemingly similar, they cannot replace each other, due to their own importance and merits in scientific discoveries. Existing approaches have been developed to address only one of the two questions. Further, most of the methods for detecting DE genes purely rely on gene expression analysis, without using the information about gene functional grouping. Methods for detecting altered gene sets often use a two-step procedure, of which the first step conducts differential expression analysis using expression data only, and the second step takes results from the first step and tries to examine whether each predefined gene set is overrepresented by DE genes through some testing procedure. Such a sequential manner in analysis might cause information loss by just focusing on summary results without using the entire expression data in the second step. Here, we propose a Bayesian joint modeling approach to address the two key questions in parallel, which incorporates the information of functional annotations into expression data analysis and meanwhile infer the enrichment of functional groups. Simulation results and analysis of experimental data obtained for E. coli show improved statistical power of our integrated approach in both identifying DE genes and altered gene sets, when compared to conventional methods.

Tài liệu tham khảo

Baldi P, Long AD (2001) A Bayesian framework for the analysis of microarray expression data: regularized t-test and statistical inferences of gene changes. Bioinformatics 17(6):509–519 Barry WT, Nobel AB, Wright FA (2005) Significance analysis of functional categories in gene expression studies: a structured permutation approach. Bioinformatics 21(9):1943–1949 Broet P, Richardson S, Radvanyi F (2002) Bayesian hierarchical model for identifying changes in gene expression from microarray experiments. J Comput Biol 9(4):671–683 Brooks S, Roberts G (1998) Convergence assessment techniques for Markov chain Monte Carlo. Stat Comput 8:319–335 Brown MP, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares MJ, Haussler D (2000) Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci USA 97(1):262–267 Courcelle J, Khodursky A, Peter B, Brown PO, Hanawalt PC (2001) Comparative gene expression profiles following UV exposure in wild-type and SOS-deficient Escherichia coli. Genetics 158:41–64 Efron B, Tibshirani R (2007) On testing the significance of sets of genes. Ann Appl Stat 1:107–129 Efron B, Tishirani R, Storey J, Tusher V (2001) Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 96:1151–1160 Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95(25):14863–14868 Gelman A, Rubin DB (1992) Inference from iterative simulation using multiple sequences. Stat Sci 7:457–511 Gottardo R, Pannucci JA, Kuske CR, Brettin T (2003) Statistical analysis of microarray data: a Bayesian approach. Biostatistics 4(4):597–620 Hosack DA, Dennis G, Sherman BT, Lane HC, Lempicki RA (2003) Identifying biological themes within lists of genes with ease. Genome Biol 4(10):R70 Huang D, Pan W (2006) Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data. Bioinformatics 22(10):1259–1268 Jeong KS, Xie Y, Hiasa H, Khodursky AB (2006) Analysis of pleiotropic transcriptional profiles: a case study of DNA gyrase inhibition. PLoS Genet 2:e152 Kendziorski CM, Newton MA, Lan H, Gould MN (2003) On parametric empirical Bayes methods for comparing multiple groups using replicated gene expression profiles. Stat Med 22(24):3899–3914 Kohanski MA, Dwyer DJ, Hayete B, Lawrence CA, Collins JJ (2007) A common mechanism of cellular death induced by bactericidal antibiotics. Cell 130:797–810 Kreuzer KN, Cozzarelli NR (1979) Escherichia coli mutants thermosensitive for deoxyribonucleic acid gyrase subunit A: effects on deoxyribonucleic acid replication, transcription, and bacteriophage growth. J Bacteriol 140:424–435 Lewin A, Richardson S (2006) Bayesian modelling of differential gene expression. Biometrics 62(1):1–9 Ma S, Kosorok MR (2010) Detection of gene pathways with predictive power for breast cancer prognosis. BMC Bioinform 11(1). doi:10.1186/1471-2105-11-1 Newton MA, Noueiry A, Sarkar D, Ahlquist P (2004) Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics 5(2):155–176 Newton MA, Quintana FA, den Boon JA, Sengupta S, Ahlquist P (2007) Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis. Ann Appl Stat 1:85–106 Pan W (2006) Incorporating gene functional annotations in detecting differential gene expression. J R Stat Soc C 55:301–316 Pan W (2006) Incorporating gene functions as priors in model-based clustering of microarray gene expression data. Bioinformatics 22(7):795–801 Rahnenführer J, Domingues F, Maydt J, Lengauer T (2004) Calculating the statistical significance of changes in pathway activity from gene expression data. Stat Appl Genet Mol Biol 3(1):1–29 Riley M (1998) Genes and proteins of Escherichia coli K-12. Nucleic Acids Res 26(1):54 Sangurdekar DP, Srienc F, Khodursky AB (2006) A classification based framework for quantitative description of large-scale microarray data. Genome Biol 7(4):R32 Sassanfar M, Roberts JW (1990) Nature of the SOS-inducing signal in Escherichia coli. The involvement of DNA replication. J Mol Biol 212:79–96 Serres M, Gopal S, Nahum L, Liang P, Gaasterland T, Riley M (2001) A functional update of the Escherichia coli K-12 genome. Genome Biol 2(9):1–0035 Serres MH, Goswami S, Riley M (2004) Genprotec: an updated and improved analysis of functions of Escherichia coli K-12 proteins. Nucleic Acids Res 32(Database issue):D300–D302 Shen K, Tseng GC (2010) Meta-analysis for pathway enrichment analysis when combining multiple genomic studies. Bioinformatics 26(10):1316–1323 Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP (2005) Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA 102(43):15545–15550 Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ (2005) Discovering statistically significant pathways in expression profiling studies. Proc Natl Acad Sci USA 102(38):13544–13549 Tusher VG, Tibshirani R, Chu G (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 98(9):5116–5121 Wei P, Pan W (2008) Incorporating gene networks into statistical tests for genomic data via a spatially correlated mixture model. Bioinformatics 24(3):404–411 Wu LF, Hughes TR, Davierwala AP, Robinson MD, Stoughton R, Altschuler SJ (2002) Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nat Genet 31(3):255–265 Xiao G, Pan W (2007) Consensus clustering of gene expression data and its application to gene function prediction. J Comput Graph Stat 16(3):1–19 Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J, Speed TP (2002) Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res 30(4):e15 Zhou X, Kao M-CJ, Wong WH (2002) Transitive functional annotation by shortest-path analysis of gene expression data. Proc Natl Acad Sci USA 99(20):12783–12788