Understanding the role of training sample size in the uncertainty of high-resolution LULC mapping using random forest
Tóm tắt
High-resolution sensors onboard satellites are generally reputed for rapidly producing land-use/land-cover (LULC) maps with improved spatial detail. However, such maps are subject to uncertainties due to several factors, including the training sample size. We investigated the effects of different training sample sizes (from 1000 to 12,000 pixels) on LULC classification accuracy using the random forest (RF) classifier. Then, we analyzed classification uncertainties by determining the median and the interquartile range (IQR) of the overall accuracy (OA) values through repeated k-fold cross-validation. Results showed that increasing training pixels significantly improved OA while minimizing model uncertainty. Specifically, larger training samples, ranging from 9000 to 12,000 pixels, exhibited narrower IQRs than smaller samples (1000–2000 pixels). Furthermore, there was a significant variation (Chi2 = 85.073; df = 11; p < 0.001) and a significant trend (J-T = 4641, p < 0.001) in OA values across various training sample sizes. Although larger training samples generally yielded high accuracies, this trend was not always consistent, as the lowest accuracy did not necessarily correspond to the smallest training sample. Nevertheless, models using 9000–11,000 pixels were effective (OA > 96%) and provided an accurate visual representation of LULC. Our findings emphasize the importance of selecting an appropriate training sample size to reduce uncertainties in high-resolution LULC classification.
Tài liệu tham khảo
Abriha D, Srivastava PK, Szabó S (2023) Smaller is better? Unduly nice accuracy assessments in roof detection using remote sensing data with machine learning and k-fold cross-validation. Heliyon 9:1–17. https://doi.org/10.1016/j.heliyon.2023.e14045
Anderson JR, Hardy EE, Roach JT, Witmer RE (1976) A land use and land cover classification system for use with remote sensor data. US Geol Surv Prof Paper 964:28
Aune-Lundberg L, Strand G-H (2014) Environ Model Softw 61:87–97. https://doi.org/10.1016/j.envsoft.2014.07.001. Comparison of variance estimation methods for use with two-dimensional systematic sampling of land use/land cover data
Belgiu M, Drăgu L (2016) Random forest in remote sensing: a review of applications and future directions. ISPRS J Photogrammetry Remote Sens 114:24–31. https://doi.org/10.1016/j.isprsjprs.2016.01.011
Bobalova H, Benová A, Kožuch M (2021) Hierarchical object-based mapping of Urban Land Cover using Sentinel-2 data: a case study of six cities in Central Europe. PFG–Journal of Photogrammetry Remote Sensing and Geoinformation Science 89:15–31. https://doi.org/10.1007/s41064-020-00135-8
Breiman L (2001) Random forests. Mach Learn 45:5–32. https://doi.org/10.1023/A:1010933404324
Bui DH, Mucsi L (2022) Predicting the future land-use change and evaluating the change in landscape pattern in Binh Duong province, Vietnam. Hung Geographical Bull 71:349–364. https://doi.org/10.15201/hungeobull.71.4.3
Burai P, Deák B, Valkó O, Tomor T (2015) Classification of herbaceous vegetation using airborne hyperspectral imagery. Remote Sens 7:2046–2066. https://doi.org/10.3390/rs70202046
Chatziantoniou A, Petropoulos GP, Psomiadis E (2017) Co-Orbital Sentinel 1 and 2 for LULC mapping with emphasis on wetlands in a mediterranean setting based on machine learning. Remote Sens 9:1259. https://doi.org/10.3390/rs9121259
Cheng KS, Ling JY, Lin TW et al (2021) Quantifying uncertainty in Land-Use/Land-Cover classification accuracy: a Stochastic Simulation Approach. Front Environ Sci 9:1–18. https://doi.org/10.3389/fenvs.2021.628214
Congalton RG (1991) A review of assessing the accuracy of classifications of remotely sensed data. Remote Sens Environ 37:35–46. https://doi.org/10.1016/0034-4257(91)90048-B
Cutler DR, Edwards TC Jr, Beard KH et al (2007) Random forests for classification in ecology. Ecology 88:2783–2792. https://doi.org/10.1890/07-0539.1
Ebrahimy H, Mirbagheri B, Matkan AA, Azadbakht M (2021) Per-pixel land cover accuracy prediction: a random forest-based method with limited reference sample data. ISPRS J Photogrammetry Remote Sens 172:17–27. https://doi.org/10.1016/j.isprsjprs.2020.11.024
ESRI (2022) ArcGIS Desktop Software (Version 10.4)
Everitt JH, Yang C, Fletcher R, Deloach CJ (2008) Comparison of QuickBird and SPOT 5 satellite imagery for mapping giant reed. J Aquat Plant Manag 46:77–82
Foody GM, Mathur A, Sanchez-Hernandez C, Boyd DS (2006) Training set size requirements for the classification of a specific class. Remote Sens Environ 104:1–14. https://doi.org/10.1016/j.rse.2006.03.004
Gascon F, Ramoino F (2017) Sentinel-2 data exploitation with ESA’s Sentinel-2 Toolbox. In: EGU General Assembly Conference Abstracts. p 19548
Gudmann A, Mucsi L (2022) Pixel and object-based Land Cover Mapping and Change Detection from 1986 to 2020 for Hungary using Histogram-based gradient boosting classification Tree Classifier. Geogr Pannonica 26:165–175. https://doi.org/10.5937/gp26-37720
Heydari SS, Mountrakis G (2018) Effect of classifier selection, reference sample size, reference class distribution and scene heterogeneity in per-pixel classification accuracy using 26 landsat sites. Remote Sens Environ 204:648–658. https://doi.org/10.1016/j.rse.2017.09.035
Higgs C, van Niekerk A (2022) Impact of Training Set Configurations for differentiating Plantation Forest Genera with Sentinel-2 Imagery and Machine Learning. Remote Sens 14:3992. https://doi.org/10.3390/rs14163992
Huang C, Asner GP (2009) Applications of remote sensing to alien invasive plant studies. Sensors 9:4869–4889. https://doi.org/10.3390/s90604869
Jensen JR, Cowen DC (1999) Remote sensing of urban/suburban infrastructure and socio-economic attributes. Photogramm Eng Remote Sensing 65:611–622
Jia Y, Ge Y, Ling F et al (2018) Urban land use mapping by combining remote sensing imagery and mobile phone positioning data. Remote Sens 10:446. https://doi.org/10.3390/rs10030446
Jonckheere AR (1954) A distribution-free k-sample test against ordered alternatives. Biometrika 41:133–145. https://doi.org/10.2307/2333011
Khatami R, Mountrakis G, Stehman SV (2016) A meta-analysis of remote sensing research on supervised pixel-based land-cover image classification processes: General guidelines for practitioners and future research. Remote Sens Environ 177:89–100. https://doi.org/10.1016/j.rse.2016.02.028
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet Classification with Deep Convolutional Neural Networks. In: Advances in Neural Information Processing Systems 25. pp 1–9
Kruskal WH, Wallis WA (1952) Use of ranks in one-criterion variance analysis. J Am Stat Assoc 47:583–621. https://doi.org/10.1080/01621459.1952.10483441
Kuhn M, Wing S, Weston A, Williams C et al (2023) Caret: classification and regression training. R Package Version 6:0–94. https://github.com/topepo/caret/
Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 33:159–174. https://doi.org/10.2307/2529310
Liaw A, Wiener M (2002) Classification and regression by randomForest. R news 2:18–22
Luo X, Tong X, Hu Z, Wu G (2020) Improving urban land cover/use mapping by integrating a hybrid convolutional neural network and an automatic training sample expanding strategy. Remote Sens 12:2292. https://doi.org/10.3390/rs12142292
Ma L, Li M, Ma X et al (2017) A review of supervised object-based land-cover image classification. ISPRS J Photogrammetry Remote Sens 130:277–293. https://doi.org/10.1016/j.isprsjprs.2017.06.001
Matcı DK, Avdan U (2022) Data-driven automatic labelling of land cover classes from remotely sensed images. Earth Sci Inform 15:1059–1071. https://doi.org/10.1007/s12145-022-00788-6
Maxwell AE, Strager MP, Warner TA et al (2019) Large-Area, high spatial Resolution Land Cover Mapping using Random forests, GEOBIA, and NAIP Orthophotography: findings and recommendations. Remote Sens 11:1409. https://doi.org/10.3390/rs11121409
Mazeka B, Phinzi K, Sutherland C (2021) Monitoring changing Land Use-Land Cover Change to reflect the impact of Urbanisation on Environmental Assets in Durban, South Africa. Sustainable Urban futures in Africa. Routledge, pp 132–158. https://doi.org/10.4324/9781003181484-7
Millard K, Richardson M (2015) On the importance of training data sample selection in random forest image classification: a case study in peatland ecosystem mapping. Remote Sens 7:8489–8515. https://doi.org/10.3390/rs70708489
Myburgh G, Van Niekerk A (2013) Effect of feature dimensionality on object-based land cover classification: a comparison of three classifiers. South Afr J Geomatics 2:13–27
Nagel P, Yuan F (2016) High-resolution land cover and impervious surface classifications in the twin cities metropolitan area with NAIP imagery. Photogramm Eng Remote Sensing 82:63–71. https://doi.org/10.14358/PERS.83.1.63
Padmanaban R, Bhowmik AK, Cabral P (2019) Satellite image fusion to detect changing surface permeability and emerging urban heat islands in a fast-growing city. PLoS ONE 14:1–20. https://doi.org/10.1371/journal.pone.0208949
Pawłuszek K, Marczak S, Borkowski A, Tarolli P (2019) Multi-aspect analysis of object-oriented landslide detection based on an extended set of LiDAR-derived terrain features. ISPRS Int J Geoinf 8:321. https://doi.org/10.3390/ijgi8080321
Podsiadlo I, Paris C, Bruzzone L (2021) An approach based on low resolution land-cover-maps and domain adaptation to define representative training sets at large scale. In: International Geoscience and Remote Sensing Symposium (IGARSS). Institute of Electrical and Electronics Engineers Inc., pp 313–316. https://doi.org/10.1109/IGARSS47720.2021.9553498
Qian Y, Zhou W, Yan J et al (2015) Comparing machine learning classifiers for object-based land cover classification using very high resolution imagery. Remote Sens 7:153–168. https://doi.org/10.3390/rs70100153
R Core Team (2021) R: a language and environment for statistical computing. R Foundation for statistical computing, Vienna
Ramezan CA, Warner TA, Maxwell AE, Price BS (2021) Effects of training set size on supervised machine-learning land-cover classification of large-area high-resolution remotely sensed data. Remote Sens 13:368. https://doi.org/10.3390/rs13030368
Shang M, Wang S-X, Zhou Y, Du C (2018) Effects of Training samples and classifiers on classification of Landsat-8 imagery. J Indian Soc Remote Sens 46:1333–1340. https://doi.org/10.1007/s12524-018-0777-z
Shao Y, Cooner AJ, Walsh SJ (2021) Assessing deep convolutional neural networks and assisted machine perception for urban mapping. Remote Sens 13:1523. https://doi.org/10.3390/rs13081523
Statistics South Africa (2011) “Greater Kokstad Municipality”. https://www.statssa.gov.za/?page_id=993&id=greater-kokstad-municipality. Accessed on 22 August 2023
Talukdar S, Singha P, Mahato S et al (2020) Land-use land-cover classification by machine learning classifiers for satellite observations—a review. Remote Sens 12:1135. https://doi.org/10.3390/rs12071135
Terpstra TJ (1952) The asymptotic normality and consistency of Kendall’s test against trend, when ties are present in one ranking. Indagationes Math 14:327–333
Thanh NP, Kappas M (2017) Comparison of Random Forest, k-Nearest neighbor, and support Vector Machine Classifiers for Land Cover classification using Sentinel-2 imagery. Sensors 18:18. https://doi.org/10.3390/s18010018
Therneau T, Atkinson B, Ripley B (2022) rpart: Recursive partitioning and regression trees. R package version 4.1.19. https://cran.r-project.org/package=rpart
Topaloğlu RH, Sertel E, Musaoğlu N (2016) Int archives photogrammetry remote Sens Spat Inform Sci 41:12–49. https://doi.org/10.5194/isprsarchives-XLI-B8-1055-2016. assessment of classification accuracies of Sentinel-2 and landsat-8 data for land cover/use mapping
Ustuner M, Sanli FB, Abdikan S (2016) Balanced vs imbalanced training data: classifying RapidEye data with support vector machines. Int Archives Photogrammetry Remote Sens Spat Inform Sci 41:379–384. https://doi.org/10.5194/isprs-archives-XLI-B7-379-2016
Van Niel TG, McVicar TR, Datt B (2005) On the relationship between training sample size and data dimensionality: Monte Carlo analysis of broadband multi-temporal classification. Remote Sens Environ 98:468–480. https://doi.org/10.1016/j.rse.2005.08.011