Correcting for bias in distribution modelling for rare species using citizen science data
Tóm tắt
To improve the accuracy of inferences on habitat associations and distribution patterns of rare species by combining machine‐learning, spatial filtering and resampling to address class imbalance and spatial bias of large volumes of citizen science data.
Modelling rare species’ distributions is a pressing challenge for conservation and applied research. Often, a large number of surveys are required before enough detections occur to model distributions of rare species accurately, resulting in a data set with a high proportion of non‐detections (i.e. class imbalance). Citizen science data can provide a cost‐effective source of surveys but likely suffer from class imbalance. Citizen science data also suffer from spatial bias, likely from preferential sampling. To correct for class imbalance and spatial bias, we used spatial filtering to under‐sample the majority class (non‐detection) while maintaining all of the limited information from the minority class (detection). We investigated the use of spatial under‐sampling with randomForest models and compared it to common approaches used for imbalanced data, the synthetic minority oversampling technique (
Spatial under‐sampling increased the accuracy of each model and outperformed the approach typically used to direct under‐sampling in the
Từ khóa
Tài liệu tham khảo
Chen C., 2004, Using random forest to learn imbalanced data, 110
Conn P. B., 2016, Confronting preferential sampling in wildlife surveys: Diagnosis and model‐based triage, bioArxive
Dahl T. E.(1990).Wetlands losses in the United States 1780's to 1980's. U.S. Department of Interior and U.S. Fish and Wildlife Service. Retrieved fromhttps://www.fws.gov/wetlands/Documents/Wetlands-Losses-in-the-United-States-1780s-to-1980s.pdf
Japkowicz N.(2000).The Class Imbalance Problem: Significance and Strategies. In Proceedings of the 2000 International Conference on Artificial Intelligence (IC‐AI'2000): Special Track on Inductive Learning. Las Vegas Nevada.
Kubat M. &Matwin S.(1997).Addressing the Curse of Imbalanced Training Sets: One Sided Selection. In Proceedings of the Fourteenth International Conference on Machine Learning pp. 179–186 Nashville Tennesse. Morgan Kaufmann.
Lewis D. &Ringuette M.(1994).A Comparison of Two Learning Algorithms for Text Categorization. In Proceedings of SDAIR‐94 3rd Annual Symposium on Document Analysis and Information Retrieval pp.81–93.
Liaw A., 2002, Classification and regression by randomforest, R News, 2, 18
Ling C. &Li C.(1998).Data Mining for Direct Marketing Problems and Solutions. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD‐98).New York NY:AAAI Press.
Longadge R., 2013, Class imbalance problem in data mining: Review, International Journal of Computer Science and Network, 2, 83
Meese R. J.(2009).Detection monitoring and fates of Tricolored Blackbird colonies in 2009 in the Central Valley of California. Report submitted to California Department of Fish and Game and U.S. Fish and Wildlife Service Sacramento CA USA. Retrieved fromhttp://tricolor.ice.ucdavis.edu/reports?quicktabs_1=1
2014 Cornell Lab of Ornithology Ithaca NY R. J. Meese E. C. Beedy W. J. Hamilton P. G. Rodewald Tricolored Blackbird (Agelaius tricolor) The Birds of North America
Moore F. R., 2000, Stopover ecology of nearctic–neotropical landbird migrants: Habitat relations and conservation implications, Studies in Avian Biology, 20, 133
Xue Y. Davies I. Fink D. Wood C. &Gomes C. P.(2016).Behavior identification in two‐stage games for incentivizing citizen science exploration. Proceedings of the 22nd International Principles and Practice of Constraint Programming. 707‐719.