A comparison of imputation procedures and statistical tests for the analysis of two-dimensional electrophoresis data
Tóm tắt
Numerous gel-based softwares exist to detect protein changes potentially associated with disease. The data, however, are abundant with technical and structural complexities, making statistical analysis a difficult task. A particularly important topic is how the various softwares handle missing data. To date, no one has extensively studied the impact that interpolating missing data has on subsequent analysis of protein spots. This work highlights the existing algorithms for handling missing data in two-dimensional gel analysis and performs a thorough comparison of the various algorithms and statistical tests on simulated and real datasets. For imputation methods, the best results in terms of root mean squared error are obtained using the least squares method of imputation along with the expectation maximization (EM) algorithm approach to estimate missing values with an array covariance structure. The bootstrapped versions of the statistical tests offer the most liberal option for determining protein spot significance while the generalized family wise error rate (gFWER) should be considered for controlling the multiple testing error. In summary, we advocate for a three-step statistical analysis of two-dimensional gel electrophoresis (2-DE) data with a data imputation step, choice of statistical test, and lastly an error control method in light of multiple testing. When determining the choice of statistical test, it is worth considering whether the protein spots will be subjected to mass spectrometry. If this is the case a more liberal test such as the percentile-based bootstrap t can be employed. For error control in electrophoresis experiments, we advocate that gFWER be controlled for multiple testing rather than the false discovery rate.
Tài liệu tham khảo
Morris J, Baladandayuthapani V, Herrick R, Sanna P, Gutstein H: Automated Analysis of Quantitative Image Data Using Isomorphic Functional Mixed Models with Application to Proteomics Data. UT MD Anderson Cancer Center Department of Biostatistics Working Paper Series 2010.
Wood J, White I, Cutler P: A likelihood-based approach to defining statistical significance in proteomic analysis where missing data cannot be disregarded. Signal Processing 2004,84(10):1777–1788. 10.1016/j.sigpro.2004.06.019
Jung K, Gannoun A, Sitek B, Apostolov O, Schramm A, Meyer H, Stuhler K, Urfer W: Statistical evaluation of methods for the analysis of dynamic protein expression data from a tumor study. RevStat-Statistical Journal 2006, 4: 67–80.
Pedreschi R, Hertog M, Carpentier S, Lammertyn J, Robben J, Noben J, Panis B, Swennen R, Nicolai B: Treatment of missing values for multivariate statistical analysis of gel-based proteomics data. Proteomics 2008,8(7):1371–1383. 10.1002/pmic.200700975
Jung K, Gannoun A, Sitek B, Meyer H, Stuhler K, Urfer W: Analysis of dynamic protein expression data. RevStat-Statistical Journal 2005, 3: 99–111.
Meleth S, Deshane J, Kim H: The case for well-conducted experiments to validate statistical protocols for 2D gels: different pre-processing = different lists of significant proteins. BMC biotechnology 2005.,5(7):
Horgan G: Sample size and replication in 2D gel electrophoresis studies. J Proteome Res 2007,6(7):2884–2887. 10.1021/pr070114a
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman R: Missing value estimation methods for DNA microarrays. Bioinformatics 2001,17(6):520. 10.1093/bioinformatics/17.6.520
Kim H, Golub G, Park H: Missing value estimation for DNA microarray gene expression data: local least squares imputation. Bioinformatics 2005,21(2):187. 10.1093/bioinformatics/bth499
Scheel I, Aldrin M, Glad I, Sorum R, Lyng H, Frigessi A: The influence of missing value imputation on detection of differentially expressed genes from microarray data. Bioinformatics 2005,21(23):4272–4279. 10.1093/bioinformatics/bti708
Sehgal M, Gondal I, Dooley L: Collateral Missing Value Estimation: Robust missing value estimation for consequent microarray data processing. AI 2005: Advances in Artificial Intelligence 2005, 274–283. full_text
Gan X, Liew A, Yan H: Microarray missing data imputation based on a set theoretic framework and biological knowledge. Nucleic Acids Research 2006,34(5):1608. 10.1093/nar/gkl047
Tuikkala J, Elo L, Nevalainen O, Aittokallio T: Improving missing value estimation in microarray data with gene ontology. Bioinformatics 2006,22(5):566–572. 10.1093/bioinformatics/btk019
Wang X, Li A, Jiang Z, Feng H: Missing value estimation for DNA microarray gene expression data by Support Vector Regression imputation and orthogonal coding scheme. BMC bioinformatics 2006, 7: 32. 10.1186/1471-2105-7-32
Jörnsten R, Ouyang M, Wang H: A meta-data based method for DNA microarray imputation. BMC Bioinformatics 2007, 8: 109.
Zhang X, Song X, Wang H, Zhang H: Sequential local least squares imputation estimating missing value of microarray data. Computers in Biology and Medicine 2008,38(10):1112–1120. 10.1016/j.compbiomed.2008.08.006
Nguyen D, Wang N, Carroll R: Evaluation of missing value estimation for microarray data. Journal of Data Science 2004,2(4):347–370.
Brock G, Shaffer J, Blakesley R, Lotz M, Tseng G: Which missing value imputation method to use in expression profiles: a comparative study and two selection schemes. BMC bioinformatics 2008, 9: 12. 10.1186/1471-2105-9-12
Celton M, Malpertuy A, Lelandais G, De Brevern A: Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments. BMC genomics 2010, 11: 15. 10.1186/1471-2164-11-15
Ahmad N, Zhang J, Brown P, James D, Birch J, Racher A, Smales C: On the statistical analysis of the GS-NS0 cell proteome: Imputation, clustering and variability testing. BBA-Proteins and Proteomics 2006,1764(7):1179–1187. 10.1016/j.bbapap.2006.05.002
Chang J, Van Remmen H, Ward W, Regnier F, Richardson A, Cornells J: Processing of data generated by 2-dimensional gel electrophoresis for statistical analysis: missing data, normalization, and statistics. J Proteome Res 2004,3(6):1210–1218. 10.1021/pr049886m
Coling D, Ding D, Young R, Lis M, Stofko E, Blumenthal K, Salvi R: Proteomic analysis of cisplatin-induced cochlear damage: methods and early changes in protein expression. Hearing research 2007,226(1–2):140–156. 10.1016/j.heares.2006.12.017
Trivedi P, Edwards J, Wang J, Gadbury G, Srinivasasainagendra V, Zakharkin S, Kim K, Mehta T, Brand J, Patki A, Page G, Allison D: HDBStat!: a platform-independent software suite for statistical analysis of high dimensional biology data. BMC bioinformatics 2005, 6: 86. 10.1186/1471-2105-6-86
Efron B, Tibshirani R: Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Statistical science 1986, 1: 54–75. 10.1214/ss/1177013815
Karp N, McCormick P, Russell M, Lilley K: Experimental and statistical considerations to avoid false conclusions in proteomics studies using differential in-gel electrophoresis. Molecular & Cellular Proteomics 2007,6(8):1354.
Sellers K, Miecznikowski J, Viswanathan S, Minden J, Eddy W: Lights, Camera, Action! Systematic variation in 2-D difference gel electrophoresis images. Electrophoresis 2007,28(18):3324–3332. 10.1002/elps.200600793
Hayes A: Permutation Test Is Not Distribution-Free: Testing H 0 ρ = 0. Psychological Methods 1996, 1: 184–198. 10.1037/1082-989X.1.2.184
Adams D, Anthony C: Using randomization techniques to analyse behavioural data. Animal Behaviour 1996,51(4):733–738. 10.1006/anbe.1996.0077
Edgington E: Randomization tests. CRC Press; 1995.
Manly B: Randomization, bootstrap and Monte Carlo methods in biology. Chapman & Hall/CRC; 2006.
Pitt D, Kreutzweiser D: Applications of computer-intensive statistical methods to environmental research. Ecotoxicology and environmental safety 1998,39(2):78–97. 10.1006/eesa.1997.1619
Ludbrook J, Dudley H: Why permutation tests are superior to t and F tests in biomedical research. The American Statistician 1998.,52(2): 10.2307/2685470
Tsai C, Chen Y, Chen J: Testing for differentially expressed genes with microarray data. Nucleic acids research 2003,31(9):e52. 10.1093/nar/gng052
Beasley T, Page G, Brand J, Gadbury G, Mountz J, Allison D: Chebyshev's inequality for nonparametric testing with small N and α in microarray research. Journal of the Royal Statistical Society. Series C (Applied Statistics) 2004, 53: 95–108. 10.1111/j.1467-9876.2004.00428.x
Gold D, Miecznikowski J, Liu S: Error control variability in pathway-based microarray analysis. Bioinformatics 2009,25(17):2216. 10.1093/bioinformatics/btp385
Damodaran S, Rabin R: Minimizing Variability in Two-dimensional Electrophoresis Gel Image Analysis. OMICS: A Journal of Integrative Biology 2007,11(2):225–230. 10.1089/omi.2007.0018
Bo T, Dysvik B, Jonassen I: LSimpute: accurate estimation of missing values in microarray data with least squares methods. Nucleic Acids Research 2004,32(3):e34. 10.1093/nar/gnh026
Hastie T, Tibshirani R, Narasimhan B, Chu G: impute: impute: Imputation for microarray data. 1999. [R package version 1.10.0]
Bo T, Dysvik B, Jonassen I: LSimpute: Accurate estimation of missing values in microarray data with least squares methods. 2005. [http://www.ii.uib.no/~trondb/imputation/]
Wold H: Path models with latent variables: the NIPALS approach. Quantitative sociology: International perspectives on mathematical and statistical modeling 1975, 307–357.
Stacklies W, Redestig H, to Kevin Wright for improvements to nipalsPca T: pcaMethods: A collection of PCA methods. 2007. [R package version 1.18.0]
Fellows I: Deducer: Deducer. 2009. [R package version 0.2–2] [http://CRAN.R-project.org/package=Deducer]
Wasserman L: All of statistics: a concise course in statistical inference. Springer Verlag; 2004.
Guo W, Romano J: A generalized Sidak-Holm procedure and control of generalized error rates under independence. Statistical Applications in Genetics and Molecular Biology 2007., 6: 10.2202/1544-6115.1247
Pollard KS, Gilbert HN, Ge Y, Taylor S, Dudoit S: multtest: Resampling-based multiple hypothesis testing. 2009. [R package version 2.0.0]