Robust regression via error tolerance
Tóm tắt
Real-world datasets are often characterised by outliers; data items that do not follow the same structure as the rest of the data. These outliers might negatively influence modelling of the data. In data analysis it is, therefore, important to consider methods that are robust to outliers. In this paper we develop a robust regression method that finds the largest subset of data items that can be approximated using a sparse linear model to a given precision. We show that this can yield the best possible robustness to outliers. However, this problem is NP-hard and to solve it we present an efficient approximation algorithm, termed SLISE. Our method extends existing state-of-the-art robust regression methods, especially in terms of speed on high-dimensional datasets. We demonstrate our method by applying it to both synthetic and real-world regression problems.
Tài liệu tham khảo
Alfons A, Croux C, Gelper S (2013) Sparse least trimmed squares regression for analyzing high-dimensional large data sets. Ann Appl Stat 7(1):226–248. https://doi.org/10.1214/12-AOAS575
Amaldi E, Kann V (1995) The complexity and approximability of finding maximum feasible subsystems of linear relations. Theor Comput Sci 147(1):181–210. https://doi.org/10.1016/0304-3975(94)00254-G
Ausiello G, Crescenzi P, Gambosi G, Kann V, Marchetti-Spaccamela A, Protasi M (1999) Complexity and approximation: combinatorial optimization problems and their approximability properties, 2nd edn. Springer, Berlin. https://doi.org/10.1007/978-3-642-58412-1
Barath D, Matas J (2018) Graph-cut RANSAC. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). https://arxiv.org/abs/1706.00984v2
Barath D, Noskova J, Ivashechkin M, Matas J (2020) Magsac++, a fast, reliable and accurate robust estimator. In: Proceedings of the IEEE/VF conference on computer vision and pattern recognition (CVPR). https://arxiv.org/abs/1912.05909
Björklund A (2021) SLISE—sparse linear subset explanations (Python version). https://github.com/edahelsinki/pyslise
Björklund A, Henelius A, Oikarinen E, Kallonen K, Puolamäki K (2019) Sparse robust regression for explaining classifiers. In: Discovery science. Springer, Berlin, pp 351–366. https://doi.org/10.1007/978-3-030-33778-0_27
Björklund A, Puolamäki K, Henelius A (2021) SLISE—sparse linear subset explanations (R version). https://github.com/edahelsinki/slise
Cohen G, Afshar S, Tapson J, van Schaik A (2017) EMNIST: an extension of MNIST to handwritten letters. arXiv:170205373https://arxiv.org/abs/1702.05373
Cortez P, Silva AMG (2008) Using data mining to predict secondary school student performance. In: Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008)
De Vito S, Massera E, Piga M, Martinotto L, Di Francia G (2008) On field calibration of an electronic nose for benzene estimation in an urban pollution monitoring scenario. Sens Actuators B Chem 129(2):750–757. https://doi.org/10.1016/j.snb.2007.09.060
Donoho DL, Huber PJ (1983) The notion of breakdown point. A festschrift for Erich L Lehmann, pp 157–184
Dua D, Graff C (2019) UCI machine learning repository. http://archive.ics.uci.edu/ml
Fernandes M, Guerre E, Horta E (2021) Smoothing quantile regressions. J Bus Econ Stat 39(1):338–357. https://doi.org/10.1080/07350015.2019.1660177
FGCI (2021) Finnish grid and cloud infrastructure. Urn:nbn:fi:research-infras-2016072533
Fischler MA, Bolles RC (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun ACM 24(6):381–395. https://doi.org/10.1145/358669.358692
Giloni A, Simonoff JS, Sengupta B (2006) Robust weighted lad regression. Comput Stat Data Anal 50(11):3124–3140. https://doi.org/10.1016/j.csda.2005.06.005
Hamidieh K (2018) A data-driven statistical model for predicting the critical temperature of a superconductor. Comput Mater Sci 154:346–354. https://doi.org/10.1016/j.commatsci.2018.07.052
HIP CMS Experiment (2019) Helsinki OpenData Tuples. https://hot.hip.fi/
Huber PJ (1964) Robust estimation of a location parameter. Ann Math Stat 35(1):73–101. https://doi.org/10.1214/aoms/1177703732
Hubert M, Debruyne M (2009) Breakdown value. Wiley Interdiscip Rev Comput Stat 1(3):296–302. https://doi.org/10.1002/wics.34
Koenker R, Hallock KF (2001) Quantile regression. J Econ Perspect 15(4):143–156. https://doi.org/10.1257/jep.15.4.143
Koller M, Stahel WA (2017) Nonsingular subsampling for regression s estimators with categorical predictors. Comput Stat 32(2):631–646. https://doi.org/10.1007/s00180-016-0679-x
Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the Association for Computational Linguistics: human language technologies, pp 142–150. http://www.aclweb.org/anthology/P11-1015
Microsoft and R Core Team (2019) Microsoft R Open. https://mran.microsoft.com/
Mobahi H, Fisher JW (2015) On the link between gaussian homotopy continuation and convex envelopes. In: Energy minimization methods in computer vision and pattern recognition. Springer, pp 43–56
Qin Y, Li S, Li Y, Yu Y (2017) Penalized maximum tangent likelihood estimation and robust variable selection. arXiv:170805439http://arxiv.org/abs/1708.05439
Ribeiro MT, Singh S, Guestrin C (2016) Why should I trust you?: Explaining the predictions of any classifier. In: SIGKDD, pp 1135–1144
Rousseeuw PJ (1984) Least median of squares regression. J Am Stat Assoc 79(388):871–880. https://doi.org/10.1080/01621459.1984.10477105
Rousseeuw PJ, Hubert M (2011) Robust statistics for outlier detection. Wiley Interdiscip Rev Data Min Knowl Discov 1(1):73–79. https://doi.org/10.1002/widm.2
Rousseeuw P, Yohai V (1984) Robust regression by means of S-estimators, vol 26. Springer, New York, pp 256–272. https://doi.org/10.1007/978-1-4615-7821-5_15
Rousseeuw PJ, Van Driessen K (2000) An algorithm for positive-breakdown regression based on concentration steps. In: Data analysis. Springer, pp 335–346. https://doi.org/10.1007/978-3-642-58250-9_27
Rousseeuw PJ, van Zomeren BC (1990) Unmasking multivariate outliers and leverage points. J Am Stat Assoc 85(411):633–639. https://doi.org/10.1080/01621459.1990.10474920
Schmidt M, Berg E, Friedlander M, Murphy K (2009) Optimizing costly functions with simple constraints: a limited-memory projected quasi-newton algorithm. In: Artificial intelligence and statistics, vol 5, pp 456–463. http://proceedings.mlr.press/v5/schmidt09a.html
Smucler E, Yohai VJ (2017) Robust and sparse estimators for linear regression models. Comput Stat Data Anal 111:116–130. https://doi.org/10.1016/j.csda.2017.02.002
Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B (Methodol) 58(1):267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
Wang H, Li G, Jiang G (2007) Robust regression shrinkage and consistent variable selection through the LAD-Lasso. J Bus Econ Stat 25(3):347–355. https://doi.org/10.1198/073500106000000251
Yohai VJ (1987) High breakdown-point and high efficiency robust estimates for regression. Ann Stat 15(2):642–656. https://doi.org/10.1214/aos/1176350366