K-Means Featurizer: A booster for intricate datasets
Springer Science and Business Media LLC - Trang 1-26 - 2024
Tóm tắt
Machine Learning (ML) has become pivotal across various fields, offering innovative solutions to complex data challenges. Professionals typically seek models that excel in both performance and reliability, aiming to achieve optimal generalization on future data. Since, then a variety of methods such as dummy coding, up/down-sampling, and bin-counting have been explored. However, finding a solution that effectively navigates the intricacies of limited and complex datasets still remains a challenge. This study introduces the K-Means Featurizer (KMF), an innovative algorithm crafted to enhance model performance and reliability, especially in scenarios involving complex and limited datasets. KMF employs K-Means clustering to generate enriched features that provide a nuanced understanding of the data, effectively balancing the similarity between the target variable and the feature space. This results in a more efficient predictive task by minimizing Euclidean distances and enhancing model generalizability. Our research validates KMF's effectiveness through an experiment in geoscience engineering, focusing on hydraulic conductivity (K) prediction, a vital parameter in well monitoring and infrastructure planning. Traditionally, K extraction is laborious and costly, requiring extensive pumping tests. KMF's application in this context demonstrates its potential to substantially reduce data losses during such operations. Applying KMF to the Extreme Gradient Boosting, Random Forest, K-Neighbors, Support Vector Machines, and Multiple Layers Neural Networks resulted in a significant improvement in prediction accuracy, with K-scores reaching up to 90%. While our experiment centers on geoscience engineering, KMF's utility extends to various domains facing similar data intricacies. Its adaptability to different types of complex datasets positions it as a valuable tool for diverse data-driven applications.
Tài liệu tham khảo
Abbas MA, Al WJ, David M (2023) Improving permeability prediction in carbonate reservoirs through gradient boosting hyperparameter tuning. Earth Sci Informatics. https://doi.org/10.1007/s12145-023-01099-0
Ahmed FS, Bryson LS, Crawford MM (2021) Prediction of seasonal variation of in-situ hydrologic behavior using an analytical transient infiltration model. Eng Geol 294:106383. https://doi.org/10.1016/j.enggeo.2021.106383
AI-Turbak AS, AI-Hassoun SA, AI-Othman AA (1993) Determination of Unconfined Aquifer Parameters Using Boulton, Neuman and Streltsova Methods. Eng Sci 5:155–169. https://doi.org/10.1016/S1018-3639(18)30578-6
Ali JK (1994) Neural networks: a new tool for the petroleum industry? In: SPE European Petroleum Computer Conference. p SPE--27561
Alice Z, Amenda C (2018) Feature engineering for machine learning. In: Roumeliotis R, Jeff B (eds) O’Reilly Media Inc, 1rst edn. O’Reilly Media, Inc., p 218
Altman NS (1992) An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat 46:175–185
Ankam V (2016) Big data analytics, 1rst edn. Packt Publishing Ltd, Birminghan B3, U2PB, UK
Arpitha M, Ahmed SA, Harishnaika N (2023) Correction to : Land use and land cover classification using machine learning algorithms in google earth engine. Earth Sci Informatics 5:577451
Bengfort B, Bilbro R (2019) Yellowbrick: Visualizing the Scikit-Learn Model. J Open Source Softw 4:1075. https://doi.org/10.21105/joss.01075
Bergen KJ, Johnson PA, de Hoop M V, Beroza GC (2019) Machine learning for data-driven discovery in solid Earth geoscience. Science (80- ) 363:eaau0323
Bergstra J, Bengio Y (2012) Random search for hyper-parameter optimization. J Mach Learn Res 13:281–305
Binley A, Hubbard SS, Huisman JA et al (2015) The emergence of hydrogeophysics for improved understanding of subsurface processes over multiple scales. Water Resour Res 51:3837–3886. https://doi.org/10.1002/2015WR017016.Received
Birpinar ME (2003) Aquifer parameter identification and interpretation with different analytical methods. Water SA 29:251–256. https://doi.org/10.4314/wsa.v29i3.4925
Breiman L (2019) Random Forests. Mach Learn 45:5–32. https://doi.org/10.1201/9780429469275-8
Bressan TS, Kehl de Souza M, Girelli TJ, Junior FC (2020) Evaluation of machine learning methods for lithology classification using geophysical data. Comput Geosci 139:104475. https://doi.org/10.1016/j.cageo.2020.104475
Bui DT, Lofman O, Revhaug I, Dick O (2011) Landslide susceptibility analysis in the Hoa Binh province of Vietnam using statistical index and logistic regression. Nat Hazards 59:1413–1444. https://doi.org/10.1007/s11069-011-9844-2
Buthelezi MNM, Lottering RT, Hlatshwayo ST, Peerbhay K (2020) Comparing rotation forests and extreme gradient boosting for monitoring drought damage on KwaZulu-Natal commercial forests. Geocarto Int 0:1–24. https://doi.org/10.1080/10106049.2020.1852612
Cai CZ, Han LY, Ji ZL et al (2003) SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res 31:3692–3697
Chen H (2022) Exploring subsurface hydrology with electrical resistivity tomography. Nat Rev Earth Environ 3:813. https://doi.org/10.1038/s43017-022-00350-4
Chen T, Zhu L, Niu R qing, et al (2020a) Mapping landslide susceptibility at the Three Gorges Reservoir, China, using gradient boosting decision tree, random forest and information value models. J Mt Sci 17:670–685. https://doi.org/10.1007/s11629-019-5839-3
Chen W, Cui D, Xu M, Xu R (2020b) A Method and Equipment for Continuously Testing the Permeability Coefficient of Rock and Soil Layers. Adv Civ Eng 2020. https://doi.org/10.1155/2020/6639892
Çimen M (2009) Effective procedure for determination of aquifer parameters from late time-drawdown data. J Hydrol Eng 14:446–452
Cushman JH, Tartakovsky DM, Delleur JW (2016) Elementary groundwater flow and transport Processes
Dahal A, Lombardo L (2023) Explainable artificial intelligence in geoscience: a glimpse into the future of landslide susceptibility modeling. Comput Geosci 176:105364. https://doi.org/10.1016/j.cageo.2023.105364
Deng L, Liu Y (2018) Deep learning in natural language processing. Springer, Seattle, USA
Duy H, Van Hong N, Vu Q et al (2024) Application of hybrid model - based machine learning for groundwater potential prediction in the north central of Vietnam. Earth Sci Informatics. https://doi.org/10.1007/s12145-023-01209-y
Fabien-Ouellet G, Sarkar R (2020) Seismic velocity estimation: A deep recurrent neural-network approach. Geophysics 85:U21–U29
Fang Z, Wang Y, Peng L, Hong H (2020) Integration of convolutional neural network and conventional machine learning classifiers for landslide susceptibility mapping. Comput Geosci 139:104470. https://doi.org/10.1016/j.cageo.2020.104470
Fraiman R, Justel A, Svarc M (2010) Pattern recognition via projection-based kNN rules. Comput Stat \& data Anal 54:1390–1403
Friedman JH (2001) Greedy function approximation: a gradient boosting machine. Ann Stat 29:1189–1232. https://doi.org/10.2307/2699986
Gelete G (2023) Application of hybrid machine learning ‑ based ensemble techniques for rainfall ‑ runoff modeling. Earth Sci Informatics 2475–2495. https://doi.org/10.1007/s12145-023-01041-4
Geron A (2019) Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: concepts, tools, and techniques to build intelligent systems., 1rst edn. O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472
Hadavand M, Deutsch CV (2023) Spatial multivariate data imputation using deep learning and lambda distribution. Comput Geosci 177:105376. https://doi.org/10.1016/j.cageo.2023.105376
Hermans T (2017) Prediction-Focused Approaches: An Opportunity for Hydrology. Groundwater 55:683–687. https://doi.org/10.1111/GWAT.12548
Ho TK (1995) Random decision forests. In: Proceedings of 3rd international conference on document analysis and recognition. pp 278–282
Huang S, Cai N, Pacheco PP, et al (2018) Applications of support vector machine (SVM) learning in cancer genomics. Cancer genomics \& proteomics 15:41–51
Ishii E (2021) The highest potential transmissivities of fractures in fault zones: Reference values based on laboratory and in situ hydro-mechanical experimental data. Eng Geol 294:106369. https://doi.org/10.1016/j.enggeo.2021.106369
Jain S, Pei L, Spraggins JM et al (2023) Advances and prospects for the Human BioMolecular Atlas Program (HuBMAP). Nat Cell Biol 25:1089–1100
Jin X, Han J (2010) K-Means Clustering. In: Sammut C, Webb GI (eds) Encyclopedia of Machine Learning. Springer, US, Boston, MA, pp 563–564
Kamath U, Liu J, Whitaker J (2019) Deep learning for NLP and speech recognition. Springer, VA, USA
Karpatne A, Ebert-Uphoff I, Ravela S et al (2018) Machine learning for the geosciences: Challenges and opportunities. IEEE Trans Knowl Data Eng 31:1544–1554
Kohavi R (1995) A study of cross validation and bootstrap for accuracy estimation and model selection. Int Jt Conf Artif Intell 14:1137–43
Konrad B, Luca M (2022) The Kaggle book. In: Safis Editing (ed) PACKT, 1rst edn. Birminghan B3, U2PB, UK, p 505
Kouadio KL, Liu J, Kouamelan SK, Liu R (2023) Ensemble Learning Paradigms for Flow Rate Prediction Boosting. Water Resour Manag 37:4413–4431. https://doi.org/10.1007/s11269-023-03562-5
Kouadio KL, Liu J, Liu R (2023) watex: machine learning research in water exploration. SoftwareX 22:101367. https://doi.org/10.1016/j.softx.2023.101367
Kouadio KL, Loukou NK, Coulibaly D et al (2022) Groundwater Flow Rate Prediction from Geo-Electrical Features using Support Vector Machines. Water Resour Res 58:1–33. https://doi.org/10.1029/2021wr031623
Lancashire LJ, Lemetre C, Ball GR (2009) An introduction to artificial neural networks in bioinformatics—application to complex microarray and mass spectrometry datasets in cancer studies. Brief Bioinform 10:315–329
Lantzanakis G, Mitraka Z, Chrysoulakis N (2020) X-SVM: An extension of C-SVM algorithm for classification of high-resolution satellite imagery. IEEE Trans Geosci Remote Sens 59:3805–3815
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444
Lee DT (1980) Two-Dimensional Voronoi Diagrams in the Lp-Metric. J ACM 27:604–618. https://doi.org/10.1145/322217.322219
Lenail A (2019) NN-SVG : Publication-Ready Neural Network Architecture Schematics. 4:21105. https://doi.org/10.21105/joss.00747
Leslie C, Eskin E, Noble WS (2001) The spectrum kernel: A string kernel for SVM protein classification. In: Biocomputing 2002. World Scientific, pp 564–575
Lewis A, Oliver S, Lymburner L et al (2017) The Australian geoscience data cube—foundations and lessons learned. Remote Sens Environ 202:276–292
Li M, Li L, Lai Y et al (2023) Geological Hazard Susceptibility Analysis Based on RF, SVM, and NB Models, Using the Puge Section of the Zemu River Valley as an Example. Sustainability 15:11228
Li X, Tsai FT (2009) Bayesian model averaging for groundwater head prediction and uncertainty analysis using multimodel and multimethod. 45:1–14. https://doi.org/10.1029/2008WR007488
Li X, Wang X, Jiang X et al (2022) Prediction of riverside greenway landscape aesthetic quality of urban canalized rivers using environmental modeling. J Clean Prod 367:133066
Li Z, Chen T, Wu Q, et al (2020) Application of penalized linear regression and ensemble methods for drought forecasting in Northeast China. 113–130. https://doi.org/10.1007/s00703-019-00675-8
Lin G-F, Chen G-R, Wu MC, Chou YC (2009) Effective forecasting of hourly typhoon rainfall using support vector machines. Water Resour Res 45:1–11. https://doi.org/10.1029/2009WR007911
Liu B, Rostamian A, Kheirollahi M et al (2023) Geoenergy Science and Engineering NMR log response prediction from conventional petrophysical logs with XGBoost-PSO framework. Geoenergy Sci Eng 224:211561. https://doi.org/10.1016/j.geoen.2023.211561
Liu J, Liu W, Allechy FB, et al (2024) Machine learning-based techniques for land subsidence simulation in an urban area. J Environ Manage 18. https://doi.org/10.1016/j.jenvman.2024.120078
Liu M, Nivlet P, Smith R, et al (2022) Recurrent neural network for seismic reservoir characterization. Adv Subsurf Data Anal 95–116. https://doi.org/10.1016/b978-0-12-822295-9.00010-8
Liu Y (2006) Serum proteomic pattern analysis for early cancer detection. Technol cancer Res \& Treat 5:61–66
Men N, Sun Y, Bo J et al (2012) Study of permeability coefficient in pumping test on steady flow in completely penetrating well. Adv Mater Res 378–379:362–365. https://doi.org/10.4028/www.scientific.net/AMR.378-379.362
Meng T, Lifeng M, Fengbiao W et al (2021) Experimental study on permeability evolution and nonlinear seepage characteristics of fractured rock in coupled thermo-hydraulic-mechanical environment:a case study of the sedimentary rock in Xishan area. Eng Geol 294:106339. https://doi.org/10.1016/j.enggeo.2021.106339
Naderi M (2019) Estimating confined aquifer parameters using a simple derivative-based method. Heliyon 5:e02657. https://doi.org/10.1016/j.heliyon.2019.e02657
Negash BM, Yaw AD (2020) Artificial neural network based production forecasting for a hydrocarbon reservoir under water injection. Pet Explor Dev 47:383–392
Nguyen PT, Ha DH, Nguyen HD, Phong T Van (2020) Improvement of Credal Decision Trees Using Ensemble Frameworks for Groundwater Potential Modeling. Sustainability 12. https://doi.org/10.3390/su12072622
Oh S, Noh K, Seol SJ, Byun J (2020) Cooperative deep learning inversion of controlled-source electromagnetic data for salt delineation. Geophysics 85:E121–E137. https://doi.org/10.1190/GEO2019-0532.1
Ozdemir S, Susarla D (2018) Feature Engineering Made Easy: Identify unique features from your dataset in order to build powerful machine learning systems, 1rst edn. Packt Publishing Ltd, Birminghan B3, U2PB, UK
Pedregosa F, Varoquaux G, Gramfort A, et al (2011) Scikit-learn: Machine learning in Python. J Mach Learn Res 12:2825–2830. https://doi.org/10.48550/arXiv.1201.0490
Poulton MM (2002) Neural networks as an intelligence amplification tool: A review of applications. Geophysics 67:979–993
Rahmati O, Falah F, Naghibi SA et al (2019) Land subsidence modelling using tree-based machine learning algorithms. Sci Total Environ 672:239–252. https://doi.org/10.1016/j.scitotenv.2019.03.496
Rahmati O, Golkarian A, Biggs T et al (2019) Land subsidence hazard modeling: Machine learning to identify predictors and the role of human activities. J Environ Manage 236:466–480. https://doi.org/10.1016/j.jenvman.2019.02.020
Raschka S, Mirjalili V (2019) Python Machine Learning, 3rd edn. Packt
Rojas R, Feyen L, Dassargues A (2008) Conceptual model uncertainty in groundwater modeling : Combining generalized likelihood uncertainty estimation and Bayesian model averaging. 44:1–16. https://doi.org/10.1029/2008WR006908
Rosati P, Lynn T (2021) A dataset for accounting, finance and economics research on US data breaches. Data Br 35:106924
Rostami O, Kaveh M (2021) Optimal feature selection for SAR image classification using biogeography-based optimization (BBO), artificial bee colony (ABC) and support vector machine (SVM): a combined approach of optimization and machine learning. Comput Geosci 25:911–930. https://doi.org/10.1007/s10596-020-10030-1
Rostamian A, Jamshidi S, Zirbes E (2019) The development of a novel multi-objective optimization framework for non-vertical well placement based on a modified non-dominated sorting genetic algorithm-II. 1065–1085
Sahoo S, Russo1 TA, Elliott J, Foster I (2017) Machine learning algorithms for modeling groundwater level changes in agricultural regions of the U.S. Water Resour Res 53:3878– 3895. https://doi.org/10.1002/2016WR019933
ScienceDirect (2022) Permeability coefficient. In: Elsevier Sci. Publ. Co. Inc. https://www.sciencedirect.com/topics/engineering/permeability-coefficient. Accessed 8 Oct 2022
Shi L, Gong H, Chen B, Zhou C (2020) Land subsidence prediction induced by multiple factors using machine learning method. Remote Sens 12:1–17. https://doi.org/10.3390/rs12244044
Shu K, Sliva A, Wang S et al (2017) Fake news detection on social media: A data mining perspective. ACM SIGKDD Explor Newsl 19:22–36
Sibiryakov B, Leite LWB, Sibiriakov E (2021) Porosity, specific surface area and permeability in porous media. J Appl Geophys 186:104261. https://doi.org/10.1016/j.jappgeo.2021.104261
Singh SK (2001) Confined aquifer parameters from temporal derivative of drawdowns. J Hydraul Eng 127:466–470
Srinivasan S, Cawi E, Hyman J et al (2020) Physics-informed machine learning for backbone identification in discrete fracture networks. Comput Geosci 24:1429–1444. https://doi.org/10.1007/s10596-020-09962-5
Srinivasan S, Karra S, Hyman J et al (2019) Model reduction for fractured porous media: a machine learning approach for identifying main flow pathways. Comput Geosci 23:617–629. https://doi.org/10.1007/s10596-019-9811-7
Srivastava R, Guzman-Guzman A (1994) Analysis of slope-matching methods for aquifer parameter determination. Groundwater 32:570–575
Sun K (2018) Formulating surrogate pumping test data sets to assess aquifer hydraulic conductivity. J Hydrol X 1:100004. https://doi.org/10.1016/j.hydroa.2018.100004
Sun Z, Sandoval L, Crystal-Ornelas R et al (2022) A review of Earth Artificial Intelligence. Comput Geosci 159:105034. https://doi.org/10.1016/j.cageo.2022.105034
Tang Y, Heidelberg B (2016) Groundwater Engineering: Hydrogeological parameters calculation. Tongji University Press
Theis CV (1935) The relation between the lowering of the piezometric surface and the rate and duration of discharge of a well using ground-water storage. Eos, Trans Am Geophys Union 16:519–524
Tian J, Azarian MH, Pecht M (2014) Anomaly detection using self-organizing maps-based k-nearest neighbor algorithm. In: PHM society European conference
Vapnik V, Cortes C (1995) Support-Vector Networks. Mach Learn 20:273–297. https://doi.org/10.1109/64.163674
Wei A, Li X, Yan L et al (2023) Machine learning models combined with wavelet transform and phase space reconstruction for groundwater level forecasting. Comput Geosci 177:105386. https://doi.org/10.1016/j.cageo.2023.105386
Weidner L, Walton G (2021) The influence of training data variability on a supervised machine learning classifier for Structure from Motion (SfM) point clouds of rock slopes. Eng Geol 294. https://doi.org/10.1016/j.enggeo.2021.106344
Wu H, Yang T, Li H, Zhou Z (2023) Air quality prediction model based on mRMR–RF feature selection and ISSA–LSTM. Sci Rep 13:12825
Xing H, Zhonglin Z, Shaoyu W (2015) The prediction model of earthquake casuailty based on robust wavelet v-SVM. Nat Hazards 77:717–732
Yao Y, Zhang M, Deng Y et al (2021) Evaluation of environmental engineering geology issues caused by rising groundwater levels in Xi’an. China. Eng Geol 294:106350. https://doi.org/10.1016/j.enggeo.2021.106350
Yariyan P, Janizadeh S, Van Phong T, Nguyen HD (2020) Improvement of Best First Decision Trees Using Bagging and Dagging Ensembles for Flood Probability Mapping. Water Resour Manag 34:3037–3053. https://doi.org/10.1007/s11269-020-02603-7
Yin J, Medellín-azuara J, Escriva-bou A, Liu Z (2021) Science of the Total Environment Bayesian machine learning ensemble approach to quantify model uncertainty in predicting groundwater storage change. Sci Total Environ J 769:12. https://doi.org/10.1016/j.scitotenv.2020.144715
Yu H, Chen G, Gu H (2020) A machine learning methodology for multivariate pore-pressure prediction. Comput Geosci 143:104548. https://doi.org/10.1016/j.cageo.2020.104548
Yu W, Feng T, Man X, et al (2024) Research on satellite data ‑ driven algorithm for ground ‑ level ozone concentration inversion : case of Yunnan , China. Earth Sci Informatics. https://doi.org/10.1007/s12145-023-01211-4
Zavyalova N (2017) Dataset for an analysis of communicative aspects of finance. Data Br 11:197–203
Zeye MMJ, Ouedraogo SY, Millogo M, Djigma FW, Zoure AA, Zeba M, Palenfo R, Dakio N, Zaongo SD, Wu X et al (2024) Forensic DNA database and criminal investigation in the Sahel region, a need to update the National Security Policy? Forensic Sci Res owad056. https://doi.org/10.1093/fsr/owad056
Zhang G, Wang Y, Luo C, et al (2024) FurniScene: A Large-scale 3D Room Dataset with Intricate Furnishing Scenes. arXiv Prepr arXiv240103470
Zheng C, Yuan F, Luo X, et al (2023) Mineral prospectivity mapping based on Support vector machine and Random Forest algorithm-A case study from Ashele copper-zinc deposit, Xinjiang, NW China. Ore Geol Rev 105567
Zhong R, Johnson R, Chen Z (2020) International Journal of Coal Geology Generating pseudo density log from drilling and logging-while-drilling data using extreme gradient boosting ( XGBoost ). Int J Coal Geol 220:103416. https://doi.org/10.1016/j.coal.2020.103416
Zhu L, Gong H, Li X et al (2015) Land subsidence due to groundwater withdrawal in the northern Beijing plain, China. Eng Geol 193:243–255. https://doi.org/10.1016/j.enggeo.2015.04.020
Zhuang J, Cai J, Wang R, et al (2020) Deep kNN for medical image classification. In: Medical Image Computing and Computer Assisted Intervention--MICCAI 2020: 23rd International Conference, Lima, Peru, October 4--8, 2020, Proceedings, Part I 23. pp 127–136