Collinearity: a review of methods to deal with it and a simulation study evaluating their performance

Ecography - Tập 36 Số 1 - Trang 27-46 - 2013

Carsten F. Dormann^1,2, Jane Elith³, Sven Bacher⁴, Carsten M. Buchmann⁵, Gudrun Carl¹, Gabriel Carré⁶, Jaime Márquez⁷, Bernd Gruber^1,8, Bruno Lafourcade⁹, Pedro J. Leitão^10,11, Tamara Münkemüller⁹, Colin J. McClean¹², Patrick E. Osborne¹³, Björn Reineking¹⁴, Boris Schröder^15,5, Andrew K. Skidmore¹⁶, Damaris Zurell^15,5, Sven Lautenbach^1,17

¹Helmholtz Zentrum für Umweltforschung = Helmholtz Centre for Environmental Research

²University of Freiburg [Freiburg]

³Sch Bot

⁴Université de Fribourg = University of Fribourg

⁵University of Potsdam = Universität Potsdam

⁶Services déconcentrés d'appui à la recherche Provence-Alpes-Côte d'Azur

⁷Senckenberg Museum [Frankfurt]

⁸University of Canberra

⁹Université Joseph Fourier - Grenoble 1

¹⁰Humboldt-Universität zu Berlin = Humboldt University of Berlin = Université Humboldt de Berlin

¹¹Universidade Técnica de Lisboa

¹²University of York

¹³University of Southampton *

¹⁴ University of Bayreuth

¹⁵Technische Universität Munchen - Technical University Munich - Université Technique de Munich

¹⁶University of Twente,

¹⁷Rheinische Friedrich-Wilhelms-Universität Bonn

Tóm tắt

Collinearity refers to the non independence of predictor variables, usually in a regression‐type analysis. It is a common feature of any descriptive ecological data set and can be a problem for parameter estimation because it inflates the variance of regression parameters and hence potentially leads to the wrong identification of relevant predictors in a statistical model. Collinearity is a severe problem when a model is trained on data from one region or time, and predicted to another with a different or unknown structure of collinearity. To demonstrate the reach of the problem of collinearity in ecology, we show how relationships among predictors differ between biomes, change over spatial scales and through time. Across disciplines, different approaches to addressing collinearity problems have been developed, ranging from clustering of predictors, threshold‐based pre‐selection, through latent variable methods, to shrinkage and regularisation. Using simulated data with five predictor‐response relationships of increasing complexity and eight levels of collinearity we compared ways to address collinearity with standard multiple regression and machine‐learning approaches. We assessed the performance of each approach by testing its impact on prediction to new data. In the extreme, we tested whether the methods were able to identify the true underlying relationship in a training dataset with strong collinearity by evaluating its performance on a test dataset without any collinearity. We found that methods specifically designed for collinearity, such as latent variable methods and tree based models, did not outperform the traditional GLM and threshold‐based pre‐selection. Our results highlight the value of GLM in combination with penalised methods (particularly ridge) and threshold‐based pre‐selection when omitted variables are considered in the final interpretation. However, all approaches tested yielded degraded predictions under change in collinearity structure and the ‘folk lore’‐thresholds of correlation coefficients between predictor variables of |r| >0.7 was an appropriate indicator for when collinearity begins to severely distort model estimation and subsequent prediction. The use of ecological understanding of the system in pre‐analysis variable selection and the choice of the least sensitive statistical approaches reduce the problems of collinearity, but cannot ultimately solve them.

Từ khóa

Tài liệu tham khảo

Abdi H, 2003, Encyclopedia of social sciences research methods, 792

Aichison J, 2003, The statistical analysis of compositional data

10.1002/wics.84

10.1126/science.1131758

10.1080/03610910008813652

10.1007/BF00048865

10.1016/S0304-3800(02)00205-3

10.1093/condor/108.1.59

Belsley D. A, 1991, Conditioning diagnostics: collinearity and weak data regression

10.1002/0471725153

10.1111/j.1541-0420.2007.00843.x

Booth G. D., 1994, Identifying proxy sets in multiple linear regression: an aid to better coefficient interpretation, US Dept of Agriculture, Forest Service

Bortz J, 1993, Statistik für Sozialwissenschaftler

10.1002/aic.690440311

10.1023/A:1010933404324

10.1214/aos/1176347115

10.2307/2983440

10.1080/01621459.1991.10475036

De Veaux R. D., 1994, Selecting models from data: AI and statistics IV, 293

10.1016/j.aca.2007.12.033

Ding C., 2004, K‐means clustering via principal component analysis, Proc. Int. Conf. Machine Learn., 225

Dobson A. J, 2002, An introduction to generalized linear models

10.3354/cr024015

10.1111/j.1467-6486.2006.00660.x

10.1111/j.2006.0906-7590.04596.x

10.1111/j.2041-210X.2010.00036.x

Fan R.‐E, 2005, Working set selection using second order information for training SVM, J. Machine Learn. Res., 6, 1889

Faraway J. J, 2005, Linear models with R

10.1080/01621459.1992.10475190

10.1093/comjnl/41.8.578

10.1046/j.1365-2656.2002.00618.x

10.18637/jss.v033.i01

10.1214/aos/1176347963

10.1214/aos/1016218223

Gelman A., 2007, Data analysis using regression and multilevel/hierarchical models

GoemanJ.2009.penalized: L1(lasso) and L2(ridge) penalized estimation in GLMs and in the Cox model.R package version 0.9‐23. –<http://CRAN.R‐project.org/package penalized>.

10.1017/CBO9780511617799

10.1890/02-3114

Guerard J., 1989, The handbook of financial modeling: the financial executive’s reference guide to accounting, finance, and investment models

Gunst R. F., 1980, Regression analysis and its application: a data‐oriented approach

10.2307/1267919

Hair J. F. Jr, 1995, Multivariate data analysis

10.2307/2684224

10.1007/978-1-4757-3462-1

10.1007/978-0-387-84858-7

10.2307/2346776

10.1080/00401706.1982.10487758

HilleRisLambers J., 2006, Hierarchical modelling for the environmental sciences, 59

10.1080/00401706.1970.10488634

10.1145/331499.331504

Johnston J, 1984, Econometric methods

Joliffe I. T, 2002, Principal component analysis

10.1111/j.1461-0248.2008.01277.x

10.1007/s10260-006-0025-5

10.1007/978-3-642-56927-2

KrämerN.et al.2007.Penalized partial least squares with applications to B‐splines transformations and functional data. –<http://ml.cs.tu‐berlin.de/nkraemer/publications.html">http://ml.cs.tu‐berlin.de/nkraemer/publications.html>.

Lebart L., 1995, Statistique exploratoire multidimensionelle

10.1080/01621459.1991.10475035

10.1080/01621459.1992.10476258

10.1080/00036840110058482

10.1039/b110779h

10.1097/AOG.0b013e318160f38e

10.1007/s10260-006-0005-9

10.1371/journal.pmed.0030260

10.1890/07-1929.1

10.1007/s10980-004-3159-6

10.1098/rsbl.2008.0097

10.1007/b98890

10.1016/j.ecolmodel.2005.10.003

10.14358/PERS.70.6.703

10.1016/j.ecolmodel.2005.11.015

10.1007/s10980-009-9383-3

10.1214/ss/1177013439

10.1007/s10531-007-9281-4

10.1111/j.1466-8238.2010.00635.x

Tabachnick B., 1989, Using multivariate statistics

10.1111/j.1365-2486.2004.00859.x

Tibshirani R, 1996, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. B, 58, 267, 10.1111/j.2517-6161.1996.tb02080.x

10.1016/S0169-7439(96)00051-2

10.1002/(SICI)1099-128X(199705)11:3<239::AID-CEM470>3.0.CO;2-A

10.1016/S0169-7439(02)00032-1

10.1080/00401706.1974.10489232

Weisberg S, 2008, dr: methods for dimension reduction for regression, R package ver. 3.0.3

10.1016/j.jspi.2006.05.008

10.1068/a38325

10.1201/9781420010404

Zha H., 2001, Spectral relaxation for K‐means clustering, Neural Inform. Process. Syst., 14, 1057

10.1111/j.1467-9868.2005.00503.x

10.1111/j.2041-210X.2009.00001.x

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Công cụ kiểm tra chính tả và thể thức Viver

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA