The influence of scaling metabolomics data on model classification accuracy

Metabolomics - Tập 11 - Trang 684-695 - 2014

Piotr S. Gromski¹, Yun Xu¹, Katherine A. Hollywood², Michael L. Turner³, Royston Goodacre¹

¹School of Chemistry, Manchester Institute of Biotechnology, The University of Manchester, Manchester, UK

²Faculty of Life Science, Manchester Institute of Biotechnology, The University of Manchester, Manchester, UK

³School of Chemistry, The University of Manchester, Manchester, UK

Tóm tắt

Correctly measured classification accuracy is an important aspect not only to classify pre-designated classes such as disease versus control properly, but also to ensure that the biological question can be answered competently. We recognised that there has been minimal investigation of pre-treatment methods and its influence on classification accuracy within the metabolomics literature. The standard approach to pre-treatment prior to classification modelling often incorporates the use of methods such as autoscaling, which positions all variables on a comparable scale thus allowing one to achieve separation of two or more groups (target classes). This is often undertaken without any prior investigation into the influence of the pre-treatment method on the data and supervised learning techniques employed. Whilst this is useful for deriving essential information such as predictive ability or visual interpretation in many cases, as shown in this study the standard approach is not always the most suitable option available. Here, a study has been conducted to investigate the influence of six pre-treatment methods—autoscaling, range, level, Pareto and vast scaling, as well as no scaling—on four classification models, including: principal components-discriminant function analysis (PC-DFA), support vector machines (SVM), random forests (RF) and k-nearest neighbours (kNN)—using three publically available metabolomics data sets. We have demonstrated that undertaking different pre-treatment methods can greatly affect the interpretation of the statistical modelling outputs. The results have shown that data pre-treatment is context dependent and that there was no single superior method for all the data sets used. Whilst we did find that vast scaling produced the most robust models in terms of classification rate for PC-DFA of both NMR spectroscopy data sets, in general we conclude that both vast scaling and autoscaling produced similar and superior results in comparison to the other four pre-treatment methods on both NMR and GC–MS data sets. It is therefore our recommendation that vast scaling is the primary pre-treatment method to use as this method appears to be more stable and robust across all the different classifiers that were conducted in this study.

Tài liệu tham khảo

Allwood, J. W., Cheung, W., Xu, Y., Mumm, R., De Vos, R. C. H., Biais, B., et al. (2014). Metabolomics in melon: a new opportunity for aroma analysis. Phytochemistry, 99, 61–72. Alsberg, B. K., Goodacre, R., Rowland, J. J., & Kell, D. B. (1997). Classification of pyrolysis mass spectra by fuzzy multivariate rule induction-comparison with regression, K-nearest neighbour, neural and decision-tree methods. Analytica Chimica Acta, 348, 389–407. Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. Bro, R., & Smilde, A. K. (2003). Centering and scaling in component analysis. Journal of Chemometrics, 17, 16–33. Broadhurst, D. I., & Kell, D. B. (2006). Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics, 2, 171–196. Brown, M., Dunn, W. B., Ellis, D. I., Goodacre, R., Handl, J., Knowles, J. D., et al. (2005). A metabolome pipeline: from concept to data to knowledge. Metabolomics, 1, 39–51. Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2, 121–167. Craig, A., Cloareo, O., Holmes, E., Nicholson, J. K., & Lindon, J. C. (2006). Scaling and normalization effects in NMR spectroscopic metabonomic data sets. Analytical Chemistry, 78, 2262–2267. Dunn, W. B., Broadhurst, D. I., Atherton, H. J., Goodacre, R., & Griffin, J. L. (2011). Systems level studies of mammalian metabolomes: the roles of mass spectrometry and nuclear magnetic resonance spectroscopy. Chemical Society Reviews, 40, 387–426. Efron, B. (1979). 1977 Rietz Lecture. Bootstrap methods: another look at the Jackknife. Annals of Statistics, 7, 1–26. Efron, B., & Gong, G. (1983). A Leisurely look at the Bootstrap, the Jackknife, and cross-validation. American Statistician, 37, 36–48. Eriksson, L., Johansson, E., Kettaneh-Wold, N., & Wold, S. (2001). Multi- and Megavariate data analysis: principles and applications. Umeå: Umetrics Academy. Fiehn, O. (2002). Metabolomics - the link between genotypes and phenotypes. Plant Molecular Biology, 48, 155–171. Fiehn, O., Kopka, J., Dormann, P., Altmann, T., Trethewey, R. N., & Willmitzer, L. (2000). Metabolite profiling for plant functional genomics. Nature Biotechnology, 18, 1157–1161. Fiehn, O., Robertson, D., Griffin, J., van der Werf, M., Nikolau, B., Morrison, N., et al. (2007). The metabolomics standards initiative (MSI). Metabolomics, 3, 175–178. Goodacre, R., Vaidyanathan, S., Dunn, W. B., Harrigan, G. G., & Kell, D. B. (2004). Metabolomics by numbers: acquiring and understanding global metabolite data. Trends in Biotechnology, 22, 245–252. Goodacre, R., Broadhurst, D., Smilde, A. K., Kristal, B. S., Baker, J. D., Beger, R., et al. (2007). Proposed minimum reporting standards for data analysis in metabolomics. Metabolomics, 3, 231–241. Gromski, P. S., Xu, Y., Correa, E., Ellis, D. I., Turner, M. L., & Goodacre, R. (2014a). A comparative investigation of modern feature selection and classification approaches for the analysis of mass spectrometry data. Analytica Chimica Acta, 829, 1–8. Gromski, P. S., Xu, Y., Kotze, H. L., Correa, E., Ellis, D. I., Armitage, E. G., et al. (2014b). Influence of missing values substitutes on multivariate analysis of metabolomics data. Metabolites, 4, 433–452. Hardy, N. W., & Taylor, C. F. (2007). A roadmap for the establishment of standard data exchange structures for metabolomics. Metabolomics, 3, 243–248. Haug, K., Salek, R. M., Conesa, P., Hastings, J., de Matos, P., Rijnbeek, M., et al. (2014). MetaboLights-an open-access general-purpose repository for metabolomics studies and associated meta-data. Nucleic Acids Research, 41, D781–D786. Hollywood, K., Brison, D. R., & Goodacre, R. (2006). Metabolomics: current technologies and future trends. Proteomics, 6, 4716–4723. Ismail, A. A., & Gill, G. V. (1999). The epidemiology of Type 2 diabetes and its current measurement. Best Practice & Research. Clinical Endocrinology & Metabolism, 13, 197–220. Karatzoglou, A., Meyer, D., & Hornik, K. (2006). Support Vector Machines in R. Journal of Statistical Software, 15, 1–28. Kell, D. B., & Goodacre, R. (2014). Metabolomics and systems pharmacology: why and how to model the human metabolic network for drug discovery. Drug Discovery, 19, 171–182. Keller, J. M., Gray, M. R., & Givens, J. A. (1985). A fuzzy K-nearest neighbor algorithm. IEEE Transactions on System Man and Cybernetics, 15, 580–585. Keun, H. C., Ebbels, T. M. D., Antti, H., Bollard, M. E., Beckonert, O., Holmes, E., et al. (2003). Improved analysis of multivariate data by variable stability scaling: application to NMR-based metabolic profiling. Analytica Chimica Acta, 490, 265–276. Kusano, M., Fukushima, A., Arita, M., Jonsson, P., Moritz, T., Kobayashi, M., et al. (2007). Unbiased characterization of genotype-dependent metabolic regulations by metabolomic approach in Arabidopsis thaliana. BMC System Biology, 1, 53. Liaw, A., & Wiener, M. (2002). Classification and Regression by randomForest. R News 2, 18–22. Mamas, M., Dunn, W. B., Neyses, L., & Goodacre, R. (2011). The role of metabolites and metabolomics in clinically applicable biomarkers of disease. Archives of Toxicology, 85, 5–17. Manly, B. F. J. (1986). Multivariate Statistical Methods: a primer. New York: Chapman and Hall. Oliver, S. G., Winson, M. K., Kell, D. B., & Baganz, F. (1998). Systematic functional analysis of the yeast genome. Trends in Biotechnology, 16, 373–378. Salek, R. M., Maguire, M. L., Bentley, E., Rubtsov, D. V., Hough, T., Cheeseman, M., et al. (2007). A metabolomic comparison of urinary changes in type 2 diabetes in mouse, rat, and human. Physiological Genomics, 29, 99–108. Salek, R. M., Steinbeck, C., Viant, M. R., Goodacre, R., & Dunn, W. B. (2013). The role of reporting standards for metabolite annotation and identification in metabolomic studies. GigaScience, 2, 13. Sansone, S.-A., Schober, D., Atherton, H. J., Fiehn, O., Jenkins, H., Rocca-Serra, P., et al. (2007). Metabolomics standards initiative: ontology working group work in progress. Metabolomics, 3, 249–256. Schuhmacher, R., Krska, R., Weckwerth, W., & Goodacre, R. (2013). Metabolomics and metabolite profiling. Analytical and Bioanalytical Chemistry, 405, 5003–5004. Sumner, L. W., Amberg, A., Barrett, D., Beale, M. H., Beger, R., Daykin, C. A., et al. (2007). Proposed minimum reporting standards for chemical analysis. Metabolomics, 3, 211–221. R Core Team (2013) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, http://www.R-project.org/. Accessed 6 Nov 2012. Todeschini, R. (1989). k-Nearest neighbour method: the influence of data transformations and metrics. Chemometrics and Intelligent. Laboratory, 6, 213–220. van den Berg, R. A., Hoefsloot, H. C. J., Westerhuis, J. A., Smilde, A. K., & van der Werf, M. J. (2006). Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics, 7(1), 142. Vapnik, V. N. (1998). Statistical Learning Theory. New York: John Willey & Sons. Wehrens, R. (2011). Chemometrics with R - multivariate data analysis in the natural sciences and life sciences. Berlin Hiedelberg: Springer-Verlag. Westerhuis, J. A., van Velzen, E. J. J., Hoefsloot, H. C. J., & Smilde, A. K. (2008). Discriminant Q(2) (DQ(2)) for improved discrimination in PLSDA models. Metabolomics, 4, 293–296. Westerhuis, J. A., van Velzen, E. J. J., Hoefsloot, H. C. J., & Smilde, A. K. (2010). Multivariate paired data analysis: multilevel PLSDA versus OPLSDA. Metabolomics, 6, 119–128. Winder, C. L., Cornmell, R., Schuler, S., Jarvis, R. M., Stephens, G. M., & Goodacre, R. (2011). Metabolic fingerprinting as a tool to monitor whole-cell biotransformations. Analytical and Bioanalytical Chemistry, 399, 387–401. Xu, Y., Zomer, S., & Brereton, R. G. (2006). Support Vector Machines: a recent method for classification in chemometrics. Critical Reviews in Analytical Chemistry, 36, 177–188. Zacharias, H. U., Schley, G., Hochrein, J., Klein, M. S., Koeberle, C., Eckardt, K.-U., et al. (2013). Analysis of human urine reveals metabolic changes related to the development of acute kidney injury following cardiac surgery. Metabolomics, 9, 697–707.

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA