A new definition for feature selection stability analysis

Teddy Lazebnik¹, Avi Rosenfeld²

¹Department of Mathematics, Ariel University, Ariel, Israel

²Department of Computer Science, Jerusalem College of Technology, Jerusalem, Israel

Tóm tắt

AbstractFeature selection (FS) stability is an important topic of recent interest. Finding stable features is important for creating reliable, non-overfitted feature sets, which in turn can be used to generate machine learning models with better accuracy and explanations and are less prone to adversarial attacks. There are currently several definitions of FS stability that are widely used. In this paper, we demonstrate that existing stability metrics fail to quantify certain key elements of many datasets such as resilience to data drift or non-uniformly distributed missing values. To address this shortcoming, we propose a new definition for FS stability inspired by Lyapunov stability in dynamic systems. We show the proposed definition is statistically different from the classical record-stability on (

$$n=90$$

n = 90 ) datasets. We present the advantages and disadvantages of using Lyapunov and other stability definitions and demonstrate three scenarios in which each one of the three proposed stability metrics is best suited.

Từ khóa

Tài liệu tham khảo

Ling, C.X., Huang, J., Zhang, H.: AUC: a better measure than accuracy in comparing learning algorithms. Adv. Artif. Intell. (2003)

Huang, J., Ling, C.X.: Using auc and accuracy in evaluating learning algorithms. Adv. Artif. Intell. 17(3), 299–310 (2005)

Al-Jarrah, O.Y., Yoo, P.D., Muhaidat, S., Karagiannidis, G.K., Taha, K.: Efficient machine learning for big data: a review. Big Data Res. 2(3), 87–93 (2015)

Jordan, M.I., Mitchell, T.M.: Machine learning: trends, perspectives, and prospects. Science 349(6245), 255–260 (2015)

Beriman, L.: Heuristics of instability and stabilization in model selection. Ann. Stat. 24, 2350–2383 (1996)

Bousquet, O., Elisseff, A.: Stability and generalization. J. Mach. Learn. Res. 2, 499–526 (2002)

Rosenfeld, A., Richardson, A.: Explainability in human-agent systems. Auton. Agents Multi-Agent Syst. 33(6), 673–705 (2019)

Ben-Hur, A., Elisseeff, I., Guyon, A.: A stability based method for discovering structure in clustered data. Pac. Symp. Biocomput. 1, 6–17 (2002)

Meinshausen, N., Buhlmann, P.: Stability selection. J. R. Stat. Soc. 72, 414–473 (2010)

Wang, J.: Consistent selection of the number of clusters via cross validation. Biometrika 72, 893–904 (2010)

Liu, K., Roeder, K., Wasserman, L.: Stability approach to regularization selection for high-dim graphical models. Adv. Neural Inf. Process. Syst. 23, (2010)

Stodden, V., Leisch, F., Peng, R.: Implementing reproducible research. CRC Press (2014)

Shah, P., Kendall, F., Khozin, S., Goosen, R., Hu, J., Laramie, J., Ringel, M., Schork, N.: Artificial intelligence and machine learning in clinical development: a transnational perspective. Npj Digit. Med. 69, 1–34 (2019)

Boyko, N., Sviridova, T., Shakhovska, N.: Use of machine learning in the forecast of clinical consequences of cancer diseases. 7th Mediterranean Conference on Embedded Computing (MECO), pp. 1–6 (2018)

Yaniv-Rosenfeld, A., Savchenko, E., Rosenfeld, A., Lazebnik, T.: Scheduling bcg and il-2 injections for bladder cancer immunotherapy treatment. Mathematics, 1–6 (2018)

Veturi, Y.A., Woof, W., Lazebnik, T., Moghul, I., Woodward-Court, P., Wagner, S.K., Cabral de Guimaraes, T.A., Daich Varela, M., Liefers, B., Patel, P.J., Beck, S., Webster, A.R., Mahroo, O., Keane, P.A., Michaelides, M., Balaskas, K., Pontikos, N.: Syntheye Investigating the impact of synthetic data on artificial intelligence-assisted gene diagnosis of inherited retinal disease. Ophthalmology Science 3(2), 100258 (2023)

Weng, S.F., Reps, J., Kai, J., Garibaldi, J.M., Qureshi, N.: Can machine-learning improve cardiovascular risk prediction using routine clinical data? PLOS ONE 12, e0174944 (2017)

Bonner, G.: Decision making for health care professionals: use of decision trees within the community mental health setting. J. Adv. Nursing 35, 349–356 (2001)

Flechet, M., Güiza, F., Schetz, M., Wouters, P., Vanhorebeek, I., Derese, I., Gunst, J., Spriet, I., Casaer, M., Van den Berghe, G., Meyfroidt, G.: Akipredictor, an online prognostic calculator for acute kidney injury in adult critically ill patients: development, validation and comparison to serum neutrophil gelatinase-associated lipocalin. J. Adv. Nursing 35, 349–356 (2001)

Shung, D.L., Au, B., Taylor, R.A., Tay, J.K., Laursen, S.B., Stanley, A.J., Dalton, H.R., Ngu, J., Schultz, M., Laine, L.: Validation of a machine learning model that outperforms clinical risk scoring systems for upper gastrointestinal bleeding. Gastroenterology 158, 160–167 (2020)

Shamout, F., Zhu, T., Clifton, D.A.: Machine learning for clinical outcome prediction. IEEE Rev. Biomed. Eng. 14, 116–126 (2020)

Lazebnik, T., Somech, A., Weinberg, A.I.: Substrat: a subset-based optimization strategy for faster automl. Proc. VLDB Endow. 16(4), 772–780 (2022)

Aztiria, A., Farhadi, G., Aghajan, H.: User Behavior Shift Detection in Intelligent Environments. Springer, (2012)

Gama, J., Zliobaite, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. (CSUR), 46, (2014)

Cavalcante, R.C., Oliveira, A.L.I.: An approach to handle concept drift in financial time series based on extreme learning machines and explicit drift detection. Int. Jt. Conf. Neural Netw. (IJCNN), 1–8 (2015)

Lazebnik, T., Fleischer, T., Yaniv-Rosenfeld, A.: Benchmarking biologically-inspired automatic machine learning for economic tasks. Sustainability 11232(14), (2023)

Shami, L., Lazebnik, T.: Implementing machine learning methods in estimating the size of the non-observed economy. Comput. Econ. (2023)

K. Chaudhuri and S. A. Vinterbo. A stability-based validation procedure for differentially private machine learning. Advances in Neural Information Processing Systems, 2013

Yokoyama, H.: Machine learning system architectural pattern for improving operational stability. IEEE Int. Conf. Softw. Architecture Comp. (2019)

Bolón-Canedo, V., Alonso-Betanzos, A.: Ensembles for feature selection: a review and future trends. Inf. Fusion 52, 1–12 (2019)

Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3(Mar), 1157–1182 (2003)

Liu, H., Motoda, H., Setiono, R., Zhao, Z.: Feature selection: an ever evolving frontier in data mining. In Feature selection in data mining, p 4–13. PMLR (2010)

Rosenfeld, A.: Better metrics for evaluating explainable artificial intelligence. In: AAMAS ’21: 20th International Conference on Autonomous Agents and Multiagent Systems, pp. 45–50. ACM (2021)

Bhatt, U., Xiang, A., Sharma, S., Weller, A., Taly, A., Jia, Y., Ghosh, J., Puri, R., Moura, J.M.F., Eckersley, P.: Explainable machine learning in deployment. In: Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pp. 648–657 (2020)

Lazebnik, T., Bunimovich-Mendrazitsky, S., Rosenfeld, A.: An algorithm to optimize explainability using feature ensembles. Appl. Intell. (2024)

Sun, W.: Stability of machine learning algorithms. Purdue University, (2015)

Kenneth, O.S.: Learning concept drift with a committee of decision trees. Technical Report AI03-302, (2019)

Jain, A.K., Chandrasekaran, B.: Machine learning based concept drift detection for predictive maintenance. Comput. Ind. Eng. 137, 106031 (2019)

Khaire, U.M., Dhanalakshmi, R.: Stability of feature selection algorithm: a review. J. King Saud Univ. Comput. Inf. (2019)

Shah, R., Samworth, R.: Variable selection with error control: another look at stability selection. J. R. Stat. Soc. 75, 55–80 (2013)

Sun, W., Wang, J., Fang, Y.: Consistent selection of tuning parameters via variable selection stability. J. Mach. Learn. Res. 14, 3419–3440 (2013)

Han, Y.: Stable Feature Selection: Theory and Algorithms. PhD thesis, (2012)

Zhang, X., Fan, M., Wang, D., Zhou, P., Tao, D.: Top-k feature selection framework using robust 0-1 integer programming. IEEE Trans. Neural Netw. Learn. Syst. 32(7), 3005–3019 (2021)

Plackett, R.L.: Karl pearson and the chi-squared test. International Statistical Review/Revue Internationale de Statistique, pp. 59–72 (1983)

Chung, N.C., Miasojedow, B., Startek, M., Gambin, A.: Jaccard/tanimoto similarity test and estimation methods for biological presence-absence data. BMC Bioinform. 20, (2019)

Bajusz, D., Racz, A., Heberger, K.: Why is tanimoto index an appropriate choice for fingerprint-based similarity calculations? J. Cheminform. 20(7), (2015)

Bookstein, A., Kulyukin, V.A., Raita, T.: Generalized hamming distance. Inf. Retr. 5, 353–375 (2002)

Liu, Y., Mu, Y., Chen, K., Li, Y., Guo, J.: Daily activity feature selection in smart homes based on pearson correlation coefcient. Neural Process. Letters 51, 1771–1787 (2020)

Chandrashekar, G., Sahin, F.: A survey on feature selection methods. Comput. Electr. Eng. 40(1), 16–28 (2014)

Plackett, R.L.: Karl pearson and the chi-squared test. International Statistical Review/Revue Internationale de Statistique, 59–72 (1983)

Kanna, S.S., Ramaraj, N.: A novel hybrid feature selection via symmetrical uncertainty ranking based local memetic search algorithm. Knowl. Based Syst. 23(6), 580–585 (2010)

Chengzhang, L., Jiucheng, X.: Feature selection with the fisher score followed by the maximal clique centrality algorithm can accurately identify the hub genes of hepatocellular carcinoma. Sci. Rep. 9, 17283 (2019)

Gu, Q., Li, Z., Han, J.: Generalized fisher score for feature selection. In: Proceedings of the Twenty-Seventh Conference on Uncertainty in Artificial Intelligence, pp. 266–273. AUAI Press (2011)

Azhagusundari, B., Thanamani, A.S.: Feature selection based on information gain. Int. J. Innov. Res. Sci. Eng. Technol. 2(2), 18–21 (2013)

Bommert, A., Michel, L.: stabm: Stability measures for feature selection. J. Open Source Softw. 1, 1 (2021)

Kalousis, A., Prados, J., Hilario, M.: Evaluating feature-selection stability in next-generation proteomics. Knowl. Inf. Syst. 12(1), 95–116 (2007)

Kuncheva, L.I.: A stability index for feature selec. In: Proceedings of the 25th IASTED International Multi-Conference Artificial Intelligence and Applications (2007)

Dernoncourt, D., Hanczar, B., Zucker, J.-D.: Analysis of feature selection stability on high dimension and small sample data. Comput. Stat. Data Anal. 71, 681–693 (2013)

Saeys, Y., Abeel, T.: and Y, vol. de. Springer, Peer. Robust Feature Selection Using Ensemble Feature Selection Techniques (2008)

Yeom, S., Giacomelli, I., Fredrikson, M., Jha, S.: Privacy risk in machine learning: analyzing the connection to overfitting. In: 2018 IEEE 31st Computer Security Foundations Symposium (CSF), pp. 268–282. IEEE (2018)

Nogueira, S., Sechidis, K., Brown, G.: On the stability of feature selection algorithms. J. Mach. Learn. Res. 18, 1–54 (2018)

Lyapunov, A.M..: The general problem of the stability of motion. University Of Kharkov, (1966)

Shami, L., Lazebnik, T.: Economic aspects of the detection of new strains in a multi-strain epidemiological-mathematical model. Chaos, Solitons & Fractals 165, 112823 (2022)

Mayerhofer, T., Klein, S.J., Peer, A., Perschinka, F., Lehner, G.F., Hasslacher, J., Bellmann, R., Gasteiger, L., Mittermayr, S., Eschertzhuber, M., Mathis, S., Fiala, S., Fries, D., Kalenka, A., Foidl, E., Hasibeder, W., Helbok, R., Kirchmair, L., Stogermüller, C., Krismer, B., Heiner, T., Ladner, E., Thome, C., Preub-Hernandez, C., Mayr, A., Pechlaner, A., Potocnik, M., Reitter, M., Brunner, J., Zagitzer-Hofer, S., Ribitsch, A., Joannidis, M.: Changes in characteristics and outcomes of critically ill covid-19 patients in tyrol (Austria) over 1 year. Wiener klinische Wochenschrift 133, 1237–1247 (2021)

Liu, Y., Mu, Y., Chen, K., Li, Y., Guo, J.: Daily activity feature selection in smart homes based on pearson correlation coefcient. Neural Process. Letters 51, 1771–1787 (2020)

A. Jovie, K. Brkie, and N. Bogunovic. A review of feature selection methods with applications. IEEE, (2015). In: Russian

Liu, R., Liu, E., Yang, J., Li, M., Wang, F.: Optimizing the hyper-parameters for svm by combining evolution strategies with a grid search. Intell. Control Automation 344, (2006)

Rezatofighi, H., Tsoi, N., Gwak, J., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In CVPR 2019 (2019)

Žliobaite, I., Pechenizkiy, M., Gama, J.: Big Data Analysis: New Algorithms for a New Society, vol. 16. Springer (2016)

Gama, J.M., Zliobaite, I., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation. ACM Comput. Surv. 46(4), 1–37 (2014)

Lu, J., Liu, A., Dong, F., Gu, F., Gama, J., Zhang, G.: Learning under concept drift: a review. IEEE Trans. Knowl. Data Eng. 31(12), 2346–2363 (2019)

Marlin, B.M.: Missing data problems in machine learning. pp. 1–6. University of Toronto, (2008)

Jerez, J.M., Molina, I., Garcia-Laencina, P.J., Alba, E., Ribelles, N., Martin, M., Franco, L.: Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif. Intell. Med. 50(2), 105–115 (2010)

Ramoni, M., Sebastiani, P.: Robust learning with missing data. Mach. Learn. 45, 147–170 (2001)

Thomas, R.M., Bruin, W., Zhutovsky, P., van Wingen, G.: Chapter 14 - dealing with missing data, small sample sizes, and heterogeneity in machine learning studies of brain disorders. In: Andrea Mechelli and Sandra Vieira, editors, Machine Learning, pp. 249–266. Academic Press (2020)

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA