Statistical measures for validating plant genotype similarity assessments following multivariate analysis of metabolome fingerprint data

Metabolomics - Tập 3 - Trang 349-355 - 2007
David P. Enot1, John Draper1
1Institute of Biological Sciences, University of Wales, Aberystwyth, Aberystwyth, UK

Tóm tắt

Metabolome fingerprinting offers opportunities for ‘first pass’ evaluation of compositional similarity between plant genotypes. Compositional “substantial equivalence” testing is a popular concept in the literature in relation to food safety; however reported studies do not provide a systematic and standard approach to quantify similarity in a high dimensional data context. We have undertaken a large scale screen of Arabidopsis genotypes for evidence that individual genetic modifications effect plant phenotype at the level of the metabolome. From this study we propose pragmatic alternative measures that could in the future be used to assess substantial equivalence in GM foods under realistic data paucity constraints and without prior feature selection. Evaluation of classifier accuracy in supervised data mining approaches by bootstrap error estimation provided a robust tool for model validation. Receiver operating characteristics (such as AUC) provide an alternative measure of predictive ability by displaying the relationship between sensitivity and specificity. Additional specific measures based on scatter matrices and sample margins have also been investigated. We illustrate the application of such metrics on a large metabolic profiling data set derived from analysis of 27 genetically distinct Arabidopsis thaliana mutants. We show that agreement exists between model margins, eigenvalue, accuracies and AUC characteristics produced by three different classifiers (Random Forest, Support Vector Machine and Linear Discriminant Analysis). Comparisons between mutants with no observable phenotypic differences to the parent ecotype provided a baseline for model significance metrics; whilst comparison of mutants with increasingly distinct phenotypic alterations generated predictable changes in these measures of similarity.

Tài liệu tham khảo

Baker, J. M., Hawkins, N. D., Ward, J. L., Lovegrove, A., Napier, J. A., Shewry, P. R., & Beale, M. H. (2006). A metabolomic study of substantial equivalence of field-grown genetically modified wheat. Plant Biotechnology Journal, 4, 381–392. Berrar, D., Bradbury, I., & Dubitzky, W. (2006). Avoiding model selection bias in small-sample genomic datasets. Bioinformatics, 22, 1245–1250. Bickel, D. R. (2004). Degrees of differential gene expression: detecting biologically significant expression differences and estimating their magnitudes. Bioinformatics, 20, 682–688. Braga-Neto, U. M., & Dougherty, E. R. (2004). Is cross-validation valid for small-sample microarray classification? Bioinformatics, 20, 374–380. Braga-Neto, U., & Dougherty, E. R. (2005). Exact performance of error estimators for discrete classifiers. Pattern Recognition 38, 1799–1814. Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32. Broadhurst, D. I., & Kell, D. B. (2006). Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics, 2, 171–196. Catchpole, G. S., Beckmann, M., Enot, D. P., Mondhe, M., Zywicki, B., Taylor, J., Hardy, N., Smith, A., King, R. D., Kell, D. B., Fiehn, O., & Draper, J. (2005). Hierarchical metabolomics demonstrates substantial compositional similarity between genetically modified and conventional potato crops. Proceedings of the National Academy of Sciences of the United States of America, 102, 14458–14462. Charlton, A., Allnutt, T., Holmes, S., Chisholm, J., Bean, S., Ellis, N., Mullineaux, P., & Oehlschlager, S. (2004). NMR profiling of transgenic peas. Plant Biotechnology Journal, 2, 27–35. Choi, H. K., Choi, Y. H., Verberne, M., Lefeber, A. W., Erkelens, C., & Verpoorte, R. (2004). Metabolic fingerprinting of wild type and transgenic tobacco plants by 1H NMR and multivariate analysis technique. Phytochemistry, 65, 857–864. Cockburn, A. (2002). Assuring the safety of genetically modified (GM) foods: the importance of an holistic, integrative approach. Journal of Biotechnology, 98, 79–106. Dıaz-Uriarte, R. (2005). Supervised methods with genomic data: A review and cautionary view. Data analysis and visualization in genomics and proteomics (pp. 193–214). New York: Wiley. Dietterich, T. G. (2000). Ensemble methods in machine learning. Lecture Notes in Computer Science, 1857, 1–15. Efron, B. (1983). Estimating the error rate of a prediction rule: Improvement on cross-validation. Journal of the American Statistical Association, 78, 316–331. Efron, B., & Tibshirani, R. (1997). Improvements on cross-validation: the .632+ bootstrap method. Journal of the American Statistical Association, 92, 548–560. Enot, D. P., Beckmann, M., Overy, D., & Draper, J. (2006). Predicting interpretability of metabolome models based on behavior, putative identity, and biological relevance of explanatory signals. Proceedings of the National Academy of Sciences of the United States of America, 103, 14865–14870. Fawcett, T. (2003). ROC Graphs: Notes and practical considerations for data mining researchers. HP Laboratories technical report. Fu, W. J., Carroll, R. J., & Wang, S. (2005). Estimating misclassification error with small samples via bootstrap cross-validation. Bioinformatics, 21, 1979–1986. Fukusaki, E., & Kobayashi, A. (2005). Plant metabolomics: Potential for practical operation. Journal of Bioscience Bioengineering, 100, 347–354. Garratt, L. C., Linforth, R., Taylor, A. J., Lowe, K. C., Power, J. B., & Davey, M. R. (2005). Metabolite fingerprinting in transgenic lettuce. Plant Biotechnology Journal, 3, 165–174. Good, P. (2000). Permutation tests: A practical guide to resampling methods for testing hypotheses. Springer series in statistics. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data mining, inference, and prediction. Springer. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys (CSUR), 31, 264–323. Konig, A., Cockburn, A., Crevel, R. W., Debruyne, E., Grafstroem, R., Hammerling, U., Kimber, I., Knudsen, I., Kuiper, H. A., Peijnenburg, A. A., Penninks, A. H., Poulsen, M., Schauzu, M., & Wal, J. M. (2004). Assessment of the safety of foods derived from genetically modified (GM) crops. Food and Chemical Toxicology, 42, 1047–1088. Kuiper, H. A., Kleter, G. A., Noteborn, H. P., & Kok, E. J. (2001). Assessment of the food safety issues related to genetically modified foods. The Plant Journal, 27, 503–528. Kuiper, H. A., Kleter, G. A., Noteborn, H. P., & Kok, E. J. (2002). Substantial equivalence–an appropriate paradigm for the safety assessment of genetically modified foods? Toxicology, 181–182, 427–431. Kuiper, H. A., Kok, E. J., & Engel, K. H. (2003). Exploitation of molecular profiling techniques for GM food safety assessment. Current Opinion in Biotechnology, 14, 238–243. Le Gall, G., Colquhoun, I. J., Davis, A. L., Collins, G. J., & Verhoeyen, M. E. (2003). Metabolite profiling of tomato (Lycopersicon esculentum) using 1H NMR spectroscopy as a tool to detect potential unintended effects following a genetic modification. Journal of Agricultural and Food Chemistry, 51, 2447–2456. Liaw, A., Wiener, M. (2002). Classification and regression by randomForest. R News, 2, 18–22. Lyons-Weiler, J., Pelikan, R., Zeh Iii H. J., Whitcomb, D. C., Malehorn, D. E., Bigbee, W. L., & Hauskrecht, M. (2005). Assessing the statistical significance of the achieved classification error of classifiers constructed using serum peptide profiles, and a prescription for random sampling repeated studies for massive high-throughput genomic and proteomic studies. Cancer Informatics, 1, 53–77. Manetti, C., Bianchetti, C., Bizzarri, M., Casciani, L., Castro, C., D’Ascenzo, G., Delfini, M., Di Cocco, M. E., Lagana, A., Miccheli, A., Motto, M., & Conti, F. (2004). NMR-based metabonomic study of transgenic maize. Phytochemistry, 65, 3187–3198. Manetti, C., Bianchetti, C., Casciani, L., Castro, C., Di Cocco, M. E., Miccheli, A., Motto, M., & Conti, F. (2006). A metabonomic study of transgenic maize (Zea mays) seeds revealed variations in osmolytes and branched amino acids. Journal of Experimental Botany, 57, 2613–2625. Manly, B. F. J. (2004). Multivariate statistical methods: A primer. Chapman & Hall/CRC. Martinez, A. M., & Kak, A. C. (2001). PCA versus LDA. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23, 228–233. Massart, D. L. (1988). Chemometrics. Amsterdam: Elsevier. Mattoo, A. K., Sobolev, A. P., Neelam, A., Goyal, R. K., Handa, A. K., & Segre, A. L. (2006). Nuclear magnetic resonance spectroscopy-based metabolite profiling of transgenic tomato fruit engineered to accumulate spermidine and spermine reveals enhanced anabolic and nitrogen–carbon interactions. Plant Physiology, 142, 1759–1770. Shepherd, L. V., McNicol, J. W., Razzo, R., Taylor, M. A., & Davies, H. V. (2006). Assessing the potential for unintended effects in genetically modified potatoes perturbed in metabolic and developmental processes. Targeted analysis of key nutrients and anti-nutrients. Transgenic Research, 15, 409–425. Sing, T., Sander, O., Beerenwinkel, N., & Lengauer, T. (2005). ROCR: Visualizing classifier performance in R. Bioinformatics, 21, 3940–3941. Singh, S. (2003). Multiresolution estimates of classification complexity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25, 1534–1539. Somorjai, R. L., Dolenko, B., Baumgartner, R. (2003). Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: Curses, caveats, cautions. Bioinformatics, 19, 1484–1491. Tan, C. S., Ploner, A., Quandt, A., Lehtio, J., & Pawitan, Y. (2006). Finding regions of significance in SELDI measurements for identifying protein biomarkers. Bioinformatics, 22, 1515–1523. Thomaz, C. E., Boardman, J. P., Hill, D. L. G., Hajnal, J. V., Edwards, D. D., Rutherford, M. A., Gillies, D. F., & Rueckert, D. (2004). Using a Maximum Uncertainty LDA-Based Approach to Classify and Analyse MR Brain Images. Lecture Notes In Computer Science, 3216, 291–300. Windeatt, T. (2003). Vote counting measures for ensemble classifiers. Pattern Recognition, 36, 2743–2756. Yang, J., & Yang, J. (2003). Why can LDA be performed in PCA transformed space? Pattern Recognition, 36, 563–566.