A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics

Juliane Schäfer1, Korbinian Strimmer2
1Department of Statistics, University of Munich, Germany. [email protected]
2Statistics

Tóm tắt

Inferring large-scale covariance matrices from sparse genomic data is an ubiquitous problem in bioinformatics. Clearly, the widely used standard covariance and correlation estimators are ill-suited for this purpose. As statistically efficient and computationally fast alternative we propose a novel shrinkage covariance estimator that exploits the Ledoit-Wolf (2003) lemma for analytic calculation of the optimal shrinkage intensity.Subsequently, we apply this improved covariance estimator (which has guaranteed minimum mean squared error, is well-conditioned, and is always positive definite even for small sample sizes) to the problem of inferring large-scale gene association networks. We show that it performs very favorably compared to competing approaches both in simulations as well as in application to real expression data.

Từ khóa


Tài liệu tham khảo

Magwene, 2004, and Estimating genomic coexpression networks using first - order conditional independence, Genome Biology, 5, 10.1186/gb-2004-5-12-r100

Tibshirani, 2002, Diagnosis of multiple cancer type by shrunken centroids of gene expression, Proc Natl Acad Sci USA, 99, 6567, 10.1073/pnas.082099299

Efron, 1977, and Stein s paradox in statistics, Sci Am, 236, 119, 10.1038/scientificamerican0577-119

Greenland, 2000, Principles of multilevel modelling Intl, Epidemiol, 29, 158

Leung, 1998, and Estimation of the scale matrix and its eigen - values in the Wishart and the multivariate F distributions Statist Math, Ann Inst, 50, 523, 10.1023/A:1003529529228

Efron, 2004, Large - scale simultaneous hypothesis testing : the choice of a null hypothesis Amer Statist, Assoc, 99, 96, 10.1198/016214504000000089

Hoerl, 1970, and a Ridge regression : applications to nonorthogonal problems, Technometrics, 12, 69, 10.1080/00401706.1970.10488635

Hoerl, 1970, and Ridge regression : biased estimation for nonorthogonal problems, Technometrics, 12, 55, 10.1080/00401706.1970.10488634

Morris, 1983, Parametric empirical Bayes inference : theory and applica - tions Amer Statist, Assoc, 78, 47, 10.1080/01621459.1983.10477920

Ledoit, 2003, and Improved estimation of the covariance matrix of stock returns with an application to portfolio selection Empir, Finance, 10, 603

Efron, 1975, and Data analysis using Stein s estimator and its generalizations Amer Statist, Assoc, 70, 311, 10.1080/01621459.1975.10479864

Efron, 1975, Biased versus unbiased estimation Adv, Math, 16, 259

Butte, 2000, Discov - ering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks, Proc Natl Acad Sci USA, 97, 12182, 10.1073/pnas.220392197

Wille, 2004, von Rohr Bühlmann Sparse graphical Gaussian modeling of the isoprenoid gene network in Arabidopsis thaliana, Genome Biology, 5, 10.1186/gb-2004-5-11-r92

Eisen, 1998, Cluster analysis and display of genome - wide expression patterns, Proc Natl Acad Sci USA, 95, 14863, 10.1073/pnas.95.25.14863

Smyth, 2004, Linear models and empirical Bayes methods for assessing differential expression in microarray experiments Statist Biol, Appl Genet Mol, 3, 3

Cui, 2005, Improved statistical tests for differential gene expression by shrinking variance components estimates, Biostatistics, 6, 59, 10.1093/biostatistics/kxh018

Cox, 2004, and A note on pseudolikelihood from marginal densities, Biometrika, 91, 729, 10.1093/biomet/91.3.729

Toh, 2002, and Inference of a genetic network by a combined ap - proach of cluster analysis and graphical Gaussian modeling, Bioinformatics, 18, 287, 10.1093/bioinformatics/18.2.287