Centering, scaling, and transformations: improving the biological information content of metabolomics data
Tóm tắt
Extracting relevant biological information from large data sets is a major challenge in functional genomics research. Different aspects of the data hamper their biological interpretation. For instance, 5000-fold differences in concentration for different metabolites are present in a metabolomics data set, while these differences are not proportional to the biological relevance of these metabolites. However, data analysis methods are not able to make this distinction. Data pretreatment methods can correct for aspects that hinder the biological interpretation of metabolomics data sets by emphasizing the biological information in the data set and thus improving their biological interpretability. Different data pretreatment methods, i.e. centering, autoscaling, pareto scaling, range scaling, vast scaling, log transformation, and power transformation, were tested on a real-life metabolomics data set. They were found to greatly affect the outcome of the data analysis and thus the rank of the, from a biological point of view, most important metabolites. Furthermore, the stability of the rank, the influence of technical errors on data analysis, and the preference of data analysis methods for selecting highly abundant metabolites were affected by the data pretreatment method used prior to data analysis. Different pretreatment methods emphasize different aspects of the data and each pretreatment method has its own merits and drawbacks. The choice for a pretreatment method depends on the biological question to be answered, the properties of the data set and the data analysis method selected. For the explorative analysis of the validation data set used in this study, autoscaling and range scaling performed better than the other pretreatment methods. That is, range scaling and autoscaling were able to remove the dependence of the rank of the metabolites on the average concentration and the magnitude of the fold changes and showed biologically sensible results after PCA (principal component analysis). In conclusion, selecting a proper data pretreatment method is an essential step in the analysis of metabolomics data and greatly affects the metabolites that are identified to be the most important.
Tài liệu tham khảo
Reis EM, Ojopi EPB, Alberto FL, Rahal P, Tsukumo F, Mancini UM, Guimaraes GS, Thompson GMA, Camacho C, Miracca E, Carvalho AL, Machado AA, Paquola ACM, Cerutti JM, da Silva AM, Pereira GG, Valentini SR, Nagai MA, Kowalski LP, Verjovski-Almeida S, Tajara EH, Dias-Neto E, Consortium HNA: Large-scale Transcriptome Analyses Reveal New Genetic Marker Candidates of Head, Neck, and Thyroid Cancer. Cancer Res. 2005, 65: 1693-1699. 10.1158/0008-5472.CAN-04-3506. [http://cancerres.aacrjournals.org/cgi/content/abstract/65/5/1693]
van der Werf MJ: Towards replacing closed with open target selection strategies. Trends Biotechnol. 2005, 23: 11-16. 10.1016/j.tibtech.2004.11.003.
van der Werf MJ, Jellema RH, Hankemeier T: Microbial Metabolomics: replacing trial-and-error by the unbiased selection and ranking of targets. J Ind Microbiol Biotechnol. 2005, 32: 234-252. 10.1007/s10295-005-0231-4. [http://dx.doi.org/10.1007/s10295-005-0231-4]
Fiehn O: Metabolomics - the link between genotypes and phenotypes. Plant Mol Biol. 2002, 48: 151-171. 10.1023/A:1013713905833.
Shurubor YI, Paolucci U, Krasnikov BF, Matson WR, Kristal BS: Analytical precision, biological variation, and mathematical normalization in high data density metabolomics. Metabolomics. 2005, 1: 75-85. 10.1007/s11306-005-1109-1.
Keller HR, Massart DL, Liang YZ, Kvalheim OM: Evolving factor analysis in the presence of heteroscedastic noise. Anal Chim Acta. 1992, 263: 29-36. 10.1016/0003-2670(92)85422-3.
Kvalheim OM, Brakstad F, Liang Y: Preprocessing of analytical profiles in the presence of homoscedastic or heteroscedastic noise. Anal Chem. 1994, 66: 43-51. 10.1021/ac00073a010.
Bro R, Smilde AK: Centering and scaling in component analysis. J Chemom. 2003, 17: 16-33. 10.1002/cem.773.
Jackson JE: Wiley series in probability and mathematical statistics. Applied probability and statistics. A user's guide to principal components. 1991, John Wiley & Sons, Inc.
Eriksson L, Johansson E, Kettaneh-Wold N, Wold S: Scaling. Introduction to multi- and megavariate data analysis using projection methods (PCA & PLS). 1999, Umetrics, 213-225.
Smilde AK, van der Werf MJ, Bijlsma S, van der Werff-van der Vat B, Jellema RH: Fusion of mass-spectrometry-based metabolomics data. Anal Chem. 2005, 77: 6729-6736. 10.1021/ac051080y. [http://dx.doi.org/10.1021/ac051080y]
Keun HC, Ebbels TMD, Antti H, Bollard ME, Beckonert O, Holmes E, Lindon JC, Nicholson JK: Improved analysis of multivariate data by variable stability scaling: application to NMR-based metabolic profiling. Anal Chim Acta. 2003, 490: 265-276. 10.1016/S0003-2670(03)00094-1. [http://dx.doi.org/10.1016/S0003-2670(03)00094-1]
Sokal RR, Rohlf FJ: Assumptions of analysis of variance. Biometry. 1995, New York, W.H. Freeman and Co., 13: 392-450. 3rd
Hartmans S, van der Werf MJ, de Bont JAM: Bacterial degradation of styrene involving a novel flavin adenine dinucleotide-dependent styrene monooxygenase. Appl Environ Microbiol. 1990, 56: 1347-1351.
van der Werf MJ, Pieterse B, van Luijk N, Schuren F, van der Werff-van der Vat B, Overkamp K, Jellema RH: Multivariate analysis of microarray data by principal component discriminant analysis: prioritizing relevant transcripts linked to the degradation of different carbohydrates in Pseudomonas putida S12. Microbiology. 2006, 152: 257-272. 10.1099/mic.0.28278-0.
Pieterse B, Jellema RH, van der Werf MJ: Quenching of microbial samples for increased reliability of microarray data. J Microbiol Methods. 2006, 64: 207-216. 10.1016/j.mimet.2005.04.035.
Ruijter GJG, Visser J: Determination of intermediary metabolites in Aspergillus niger. J Microbiol Methods. 1996, 25: 295-302. 10.1016/0167-7012(95)00104-2.
Koek M, Muilwijk B, van der Werf MJ, Hankemeier T: Microbial metabolomics with gas chromatography mass spectrometry. Anal Chem. 2006, 78: 1272-1281. 10.1021/ac051683+. [http://dx.doi.org/10.1021/ac051683+]
Verduyn C, Postma E, Scheffers WA, van Dijken JP: Physiology of Saccharomyces cerevisiae in anaerobic glucose-limited chemostat cultures. J Gen Microbiol. 1990, 136: 395-403.
Stein SE: An integrated method for spectrum extraction and compound identification from gas chromatography/mass spectrometry data. J Am Soc Mass Spectrom. 1999, 10: 770-781. 10.1016/S1044-0305(99)00047-1.
Mathworks: Matlab 7. 2005
Eigenvector: PLS Toolbox 3.0. 2003
Jansen JJ, Hoefsloot HCJ, Boelens HFM, van der Greef J, Smilde AK: Analysis of longitudinal metabolomics data. Bioinformatics. 2004, 20: 2438-2446. 10.1093/bioinformatics/bth268.
Box GEP, Cox DR: An Analysis of Transformations. J R Statist Soc B. 1964, 26: 211-252.
Jolliffe IT: Springer Series in Statistics. Principal Component Analysis. 2002, New York, Springer-Verlag, Second Edition
Krieger CJ, Zhang P, Mueller LA, Wang A, Paley S, Arnaud M, Pick J, Rhee SY, Karp PD: MetaCyc: a multiorganism database of metabolic pathways and enzymes. Nucleic Acids Res. 2004, 32: D438-D442. 10.1093/nar/gkh100.
Kanehisa M, Goto S: KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000, 28: 27-30. 10.1093/nar/28.1.27.
Efron B, Tibshirani RJ: An Introduction to the Bootstrap. 1993, New York, Chapman & Hall, 141-152.
Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998, 95: 14863-14868. 10.1073/pnas.95.25.14863.