Robust regression via error tolerance

Data Mining and Knowledge Discovery - Tập 36 - Trang 781-810 - 2022
Anton Björklund, Andreas Henelius1,2, Emilia Oikarinen1, Kimmo Kallonen3, Kai Puolamäki1,4
1Department of Computer Science, University of Helsinki, Helsinki, Finland
2OP Financial Group, Helsinki, Finland
3Helsinki Institute of Physics, University of Helsinki, Helsinki, Finland
4Institute for Atmospheric and Earth System Research (INAR), University of Helsinki, Helsinki, Finland

Tóm tắt

Real-world datasets are often characterised by outliers; data items that do not follow the same structure as the rest of the data. These outliers might negatively influence modelling of the data. In data analysis it is, therefore, important to consider methods that are robust to outliers. In this paper we develop a robust regression method that finds the largest subset of data items that can be approximated using a sparse linear model to a given precision. We show that this can yield the best possible robustness to outliers. However, this problem is NP-hard and to solve it we present an efficient approximation algorithm, termed SLISE. Our method extends existing state-of-the-art robust regression methods, especially in terms of speed on high-dimensional datasets. We demonstrate our method by applying it to both synthetic and real-world regression problems.

Tài liệu tham khảo

Alfons A, Croux C, Gelper S (2013) Sparse least trimmed squares regression for analyzing high-dimensional large data sets. Ann Appl Stat 7(1):226–248. https://doi.org/10.1214/12-AOAS575 Amaldi E, Kann V (1995) The complexity and approximability of finding maximum feasible subsystems of linear relations. Theor Comput Sci 147(1):181–210. https://doi.org/10.1016/0304-3975(94)00254-G Ausiello G, Crescenzi P, Gambosi G, Kann V, Marchetti-Spaccamela A, Protasi M (1999) Complexity and approximation: combinatorial optimization problems and their approximability properties, 2nd edn. Springer, Berlin. https://doi.org/10.1007/978-3-642-58412-1 Barath D, Matas J (2018) Graph-cut RANSAC. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). https://arxiv.org/abs/1706.00984v2 Barath D, Noskova J, Ivashechkin M, Matas J (2020) Magsac++, a fast, reliable and accurate robust estimator. In: Proceedings of the IEEE/VF conference on computer vision and pattern recognition (CVPR). https://arxiv.org/abs/1912.05909 Björklund A (2021) SLISE—sparse linear subset explanations (Python version). https://github.com/edahelsinki/pyslise Björklund A, Henelius A, Oikarinen E, Kallonen K, Puolamäki K (2019) Sparse robust regression for explaining classifiers. In: Discovery science. Springer, Berlin, pp 351–366. https://doi.org/10.1007/978-3-030-33778-0_27 Björklund A, Puolamäki K, Henelius A (2021) SLISE—sparse linear subset explanations (R version). https://github.com/edahelsinki/slise Cohen G, Afshar S, Tapson J, van Schaik A (2017) EMNIST: an extension of MNIST to handwritten letters. arXiv:170205373https://arxiv.org/abs/1702.05373 Cortez P, Silva AMG (2008) Using data mining to predict secondary school student performance. In: Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) De Vito S, Massera E, Piga M, Martinotto L, Di Francia G (2008) On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario. Sens Actuators B Chem 129(2):750–757. https://doi.org/10.1016/j.snb.2007.09.060 Donoho DL, Huber PJ (1983) The notion of breakdown point. A festschrift for Erich L Lehmann, pp 157–184 Dua D, Graff C (2019) UCI machine learning repository. http://archive.ics.uci.edu/ml Fernandes M, Guerre E, Horta E (2021) Smoothing quantile regressions. J Bus Econ Stat 39(1):338–357. https://doi.org/10.1080/07350015.2019.1660177 FGCI (2021) Finnish grid and cloud infrastructure. Urn:nbn:fi:research-infras-2016072533 Fischler MA, Bolles RC (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun ACM 24(6):381–395. https://doi.org/10.1145/358669.358692 Giloni A, Simonoff JS, Sengupta B (2006) Robust weighted lad regression. Comput Stat Data Anal 50(11):3124–3140. https://doi.org/10.1016/j.csda.2005.06.005 Hamidieh K (2018) A data-driven statistical model for predicting the critical temperature of a superconductor. Comput Mater Sci 154:346–354. https://doi.org/10.1016/j.commatsci.2018.07.052 HIP CMS Experiment (2019) Helsinki OpenData Tuples. https://hot.hip.fi/ Huber PJ (1964) Robust estimation of a location parameter. Ann Math Stat 35(1):73–101. https://doi.org/10.1214/aoms/1177703732 Hubert M, Debruyne M (2009) Breakdown value. Wiley Interdiscip Rev Comput Stat 1(3):296–302. https://doi.org/10.1002/wics.34 Koenker R, Hallock KF (2001) Quantile regression. J Econ Perspect 15(4):143–156. https://doi.org/10.1257/jep.15.4.143 Koller M, Stahel WA (2017) Nonsingular subsampling for regression s estimators with categorical predictors. Comput Stat 32(2):631–646. https://doi.org/10.1007/s00180-016-0679-x Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the Association for Computational Linguistics: human language technologies, pp 142–150. http://www.aclweb.org/anthology/P11-1015 Microsoft and R Core Team (2019) Microsoft R Open. https://mran.microsoft.com/ Mobahi H, Fisher JW (2015) On the link between gaussian homotopy continuation and convex envelopes. In: Energy minimization methods in computer vision and pattern recognition. Springer, pp 43–56 Qin Y, Li S, Li Y, Yu Y (2017) Penalized maximum tangent likelihood estimation and robust variable selection. arXiv:170805439http://arxiv.org/abs/1708.05439 Ribeiro MT, Singh S, Guestrin C (2016) Why should I trust you?: Explaining the predictions of any classifier. In: SIGKDD, pp 1135–1144 Rousseeuw PJ (1984) Least median of squares regression. J Am Stat Assoc 79(388):871–880. https://doi.org/10.1080/01621459.1984.10477105 Rousseeuw PJ, Hubert M (2011) Robust statistics for outlier detection. Wiley Interdiscip Rev Data Min Knowl Discov 1(1):73–79. https://doi.org/10.1002/widm.2 Rousseeuw P, Yohai V (1984) Robust regression by means of S-estimators, vol 26. Springer, New York, pp 256–272. https://doi.org/10.1007/978-1-4615-7821-5_15 Rousseeuw PJ, Van Driessen K (2000) An algorithm for positive-breakdown regression based on concentration steps. In: Data analysis. Springer, pp 335–346. https://doi.org/10.1007/978-3-642-58250-9_27 Rousseeuw PJ, van Zomeren BC (1990) Unmasking multivariate outliers and leverage points. J Am Stat Assoc 85(411):633–639. https://doi.org/10.1080/01621459.1990.10474920 Schmidt M, Berg E, Friedlander M, Murphy K (2009) Optimizing costly functions with simple constraints: a limited-memory projected quasi-newton algorithm. In: Artificial intelligence and statistics, vol 5, pp 456–463. http://proceedings.mlr.press/v5/schmidt09a.html Smucler E, Yohai VJ (2017) Robust and sparse estimators for linear regression models. Comput Stat Data Anal 111:116–130. https://doi.org/10.1016/j.csda.2017.02.002 Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B (Methodol) 58(1):267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x Wang H, Li G, Jiang G (2007) Robust regression shrinkage and consistent variable selection through the LAD-Lasso. J Bus Econ Stat 25(3):347–355. https://doi.org/10.1198/073500106000000251 Yohai VJ (1987) High breakdown-point and high efficiency robust estimates for regression. Ann Stat 15(2):642–656. https://doi.org/10.1214/aos/1176350366