Correcting for bias in distribution modelling for rare species using citizen science data

Diversity and Distributions - Tập 24 Số 4 - Trang 460-472 - 2018
Orin J. Robinson1, Viviana Ruiz‐Gutiérrez1, Daniel Fink1
1Cornell Laboratory of Ornithology Ithaca NY USA

Tóm tắt

AbstractAimTo improve the accuracy of inferences on habitat associations and distribution patterns of rare species by combining machine‐learning, spatial filtering and resampling to address class imbalance and spatial bias of large volumes of citizen science data.InnovationModelling rare species’ distributions is a pressing challenge for conservation and applied research. Often, a large number of surveys are required before enough detections occur to model distributions of rare species accurately, resulting in a data set with a high proportion of non‐detections (i.e. class imbalance). Citizen science data can provide a cost‐effective source of surveys but likely suffer from class imbalance. Citizen science data also suffer from spatial bias, likely from preferential sampling. To correct for class imbalance and spatial bias, we used spatial filtering to under‐sample the majority class (non‐detection) while maintaining all of the limited information from the minority class (detection). We investigated the use of spatial under‐sampling with randomForest models and compared it to common approaches used for imbalanced data, the synthetic minority oversampling technique (SMOTE), weighted random forest and balanced random forest models. Model accuracy was assessed using kappa, Brier score and AUC. We demonstrate the method by evaluating habitat associations and seasonal distribution patterns using citizen science data for a rare species, the tricoloured blackbird (Agelaius tricolor).Main ConclusionsSpatial under‐sampling increased the accuracy of each model and outperformed the approach typically used to direct under‐sampling in the SMOTE algorithm. Our approach is the first to characterize winter distribution and movement of tricoloured blackbirds. Our results show that tricoloured blackbirds are positively associated with grassland, pasture and wetland habitats, and negatively associated with high elevations or evergreen forests during both winter and breeding seasons. The seasonal differences in distribution indicate that individuals move to the coast during the winter, as suggested by historical accounts.

Từ khóa


Tài liệu tham khảo

10.1016/j.biocon.2013.07.037

10.1016/j.ecolmodel.2013.12.012

10.1080/10106049.2011.562309

10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2

10.1111/j.0906-7590.2004.03764.x

10.1613/jair.953

Chen C., 2004, Using random forest to learn imbalanced data, 110

10.1177/001316446002000104

Conn P. B., 2016, Confronting preferential sampling in wildlife surveys: Diagnosis and model‐based triage, bioArxive

10.1890/07-2145.1

10.1007/s00484-012-0598-7

10.1890/07-0539.1

Dahl T. E.(1990).Wetlands losses in the United States 1780's to 1980's. U.S. Department of Interior and U.S. Fish and Wildlife Service. Retrieved fromhttps://www.fws.gov/wetlands/Documents/Wetlands-Losses-in-the-United-States-1780s-to-1980s.pdf

10.2307/4512139

10.1017/S0376892997000088

10.1609/aimag.v35i2.2533

10.1890/09-1340.1

10.1111/2041-210X.12242

10.1214/14-AOS1220

10.1111/ddi.12477

10.1111/j.1523-1739.2006.00354.x

10.1890/06-0539

10.1016/j.compag.2012.03.005

10.1109/TKDE.2008.239

10.1890/11-0826.1

10.1890/0012-9658(2002)083[2027:ENFAHT]2.0.CO;2

10.1371/journal.pone.0096980

10.1007/978-1-4614-7138-7

Japkowicz N.(2000).The Class Imbalance Problem: Significance and Strategies. In Proceedings of the 2000 International Conference on Artificial Intelligence (IC‐AI'2000): Special Track on Inductive Learning. Las Vegas Nevada.

10.1890/02-5364

10.1023/A:1007452223027

Kubat M. &Matwin S.(1997).Addressing the Curse of Imbalanced Training Sets: One Sided Selection. In Proceedings of the Fourteenth International Conference on Machine Learning pp. 179–186 Nashville Tennesse. Morgan Kaufmann.

Lewis D. &Ringuette M.(1994).A Comparison of Two Learning Algorithms for Text Categorization. In Proceedings of SDAIR‐94 3rd Annual Symposium on Document Analysis and Information Retrieval pp.81–93.

Liaw A., 2002, Classification and regression by randomforest, R News, 2, 18

Ling C. &Li C.(1998).Data Mining for Direct Marketing Problems and Solutions. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD‐98).New York NY:AAAI Press.

Longadge R., 2013, Class imbalance problem in data mining: Review, International Journal of Computer Science and Network, 2, 83

10.1073/pnas.2237148100

10.1111/1365-2664.12702

Meese R. J.(2009).Detection monitoring and fates of Tricolored Blackbird colonies in 2009 in the Central Valley of California. Report submitted to California Department of Fish and Game and U.S. Fish and Wildlife Service Sacramento CA USA. Retrieved fromhttp://tricolor.ice.ucdavis.edu/reports?quicktabs_1=1

2014 Cornell Lab of Ornithology Ithaca NY R. J. Meese E. C. Beedy W. J. Hamilton P. G. Rodewald Tricolored Blackbird (Agelaius tricolor) The Birds of North America

10.7717/peerj.2849

Moore F. R., 2000, Stopover ecology of nearctic–neotropical landbird migrants: Habitat relations and conservation implications, Studies in Avian Biology, 20, 133

10.1111/j.2041-210X.2012.00201.x

10.1111/2041-210X.12499

10.1016/j.ecolmodel.2005.03.026

10.2307/2999649

10.1139/f2011-170

10.1016/j.biocon.2013.11.003

10.1016/j.biocon.2014.10.021

10.1016/j.biocon.2013.05.025

10.1016/j.ecolmodel.2009.08.013

10.1111/j.1365-2664.2005.01052.x

10.1650/CONDOR-16-56.1

10.1525/california/9780520235922.003.0045

10.1145/1007730.1007734

10.1142/S0218001493000698

Xue Y. Davies I. Fink D. Wood C. &Gomes C. P.(2016).Behavior identification in two‐stage games for incentivizing citizen science exploration. Proceedings of the 22nd International Principles and Practice of Constraint Programming. 707‐719.

10.1111/2041-210x.12004