Sample selection bias and presence‐only distribution models: implications for background and pseudo‐absence data

Ecological Applications - Tập 19 Số 1 - Trang 181-197 - 2009
Steven J. Phillips1, Miroslav Dudík2, Jane Elith3, Catherine H. Graham4, Anthony Lehmann5, John R. Leathwick6, Simon Ferrier7
1AT&T Labs–Research, 180 Park Avenue, Florham Park, New Jersey 07932 USA
2Computer Science Department, Princeton University, 35 Olden Street, Princeton, New Jersey 08544 USA
3School of Botany, University of Melbourne, Parkville, Victoria 3010, Australia
4Department of Ecology and Evolution, 650 Life Sciences Building, Stony Brook University, New York 11794 USA
5Climatic Change and Climate Impacts, University of Geneva, 7 Route de Drize, 1227 Carouge, Switzerland
6NIWA, Hamilton, New Zealand
7New South Wales Department of Environment and Climate Change, P.O. Box 402, Armidale 2350 Australia

Tóm tắt

Most methods for modeling species distributions from occurrence records require additional data representing the range of environmental conditions in the modeled region. These data, called background or pseudo‐absence data, are usually drawn at random from the entire region, whereas occurrence collection is often spatially biased toward easily accessed areas. Since the spatial bias generally results in environmental bias, the difference between occurrence collection and background sampling may lead to inaccurate models. To correct the estimation, we propose choosing background data with the same bias as occurrence data. We investigate theoretical and practical implications of this approach. Accurate information about spatial bias is usually lacking, so explicit biased sampling of background sites may not be possible. However, it is likely that an entire target group of species observed by similar methods will share similar bias. We therefore explore the use of all occurrences within a target group as biased background data. We compare model performance using target‐group background and randomly sampled background on a comprehensive collection of data for 226 species from diverse regions of the world. We find that target‐group background improves average performance for all the modeling methods we consider, with the choice of background data having as large an effect on predictive performance as the choice of modeling method. The performance improvement due to target‐group background is greatest when there is strong bias in the target‐group presence records. Our approach applies to regression‐based modeling methods that have been adapted for use with occurrence data, such as generalized linear or additive models and boosted regression trees, and to Maxent, a probability density estimation method. We argue that increased awareness of the implications of spatial bias in surveys, and possible modeling remedies, will substantially improve predictions of species distributions.

Từ khóa


Tài liệu tham khảo

10.1046/j.1365-2699.2003.00867.x

10.1007/s10651-005-6816-2

10.1016/S0304-3800(02)00200-4

Busby J. R., 1991, Nature conservation: cost effective biological surveys and data analysis, 64

Cadman M., 2008, Atlas of the breeding birds of Ontario, 2001–2005

10.1007/BF00051966

10.1023/A:1021350813586

10.1890/0012-9658(2007)88[243:BTFEMA]2.0.CO;2

10.1023/A:1009690919835

Dudík M., 2005, Advances in neural information processing systems 18, 323

Dudík M., 2007, Maximum entropy density estimation with generalized regularization and an application to species distribution modeling., Journal of Machine Learning Research, 8, 1217

10.1111/j.2006.0906-7590.04596.x

10.1111/j.1472-4642.2007.00340.x

10.1111/j.0021-8901.2004.00881.x

10.1023/A:1021302930424

10.1017/S0376892997000088

10.1214/aos/1176347963

Friedman J. H., 2001, Greedy function approximation: a gradient boosting machine., Annals of Statistics, 29, 1189, 10.1214/aos/1013203451

Gelfand A. E., 2006, Explaining species distribution patterns through hierarchical modeling., Bayesian Analysis, 1, 41, 10.1214/06-BA102

10.1016/j.tree.2004.07.006

10.1890/06-1060.1

Hastie T., 1990, Generalized additive models

Heckman J. J., 1979, Sample selection bias as a specification error., Econometrica, 47, 153, 10.2307/1912352

10.1111/j.0906-7590.2006.04700.x

Hirzel A. H., 2002, Ecological-niche factor analysis: how to compute habitat-suitability maps without absence data., Ecology, 87, 2027, 10.1890/0012-9658(2002)083[2027:ENFAHT]2.0.CO;2

Huang J., 2007, Advances in neural information processing systems 19, 601, 10.7551/mitpress/7503.003.0080

10.1103/PhysRev.106.620

10.2193/0022-541X(2004)068[0774:UAIOLR]2.0.CO;2

10.1016/j.tree.2008.02.001

10.1016/0304-4076(94)01698-4

10.3354/meps321267

10.1111/j.1365-2427.2005.01448.x

10.1111/j.1523-1739.2003.00233.x

10.1111/j.1365-2664.2006.01191.x

Manly B., 2002, Resource selection by animals: statistical design and analysis for field studies. Second edition

10.1017/S136794300300307X

10.1126/science.285.5431.1265

10.1016/j.ecolmodel.2005.03.026

10.1111/j.0906-7590.2008.5203.x

10.1046/j.1523-1739.2001.015003648.x

10.1046/j.1365-2699.2003.00946.x

10.1111/j.1365-2699.2007.01716.x

10.1080/136588199241391

10.1641/0006-3568(2004)054[0066:TVOMCF]2.0.CO;2

10.1038/nature02121

10.1111/j.1365-2486.2005.001018.x

Ward G., Presence-only data and the EM algorithm., Biometrics

10.5670/oceanog.2003.42

10.2307/3236170

10.1145/1015330.1015425

10.1016/S0304-3800(02)00199-0

10.1002/1097-0258(20000715)19:13<1771::AID-SIM485>3.0.CO;2-P