Nonparametric semi-supervised classification with application to signal detection in high energy physics

Journal of the Italian Statistical Society - Tập 31 - Trang 531-550 - 2021
Alessandro Casa1, Giovanna Menardi2
1School of Mathematics and Statistics, University College Dublin, Belfield, Ireland
2Department of Statistical Sciences, University of Padova, Padova, Italy

Tóm tắt

Model-independent searches in particle physics aim at completing our knowledge of the universe by looking for new possible particles not predicted by the current theories. Such particles, referred to as signal, are expected to behave as a deviation from the background, representing the known physics. Information available on the background can be incorporated in the search, in order to identify potential anomalies. From a statistical perspective, the problem is recasted to a peculiar classification one where only partial information is accessible. Therefore a semi-supervised approach shall be adopted, either by strengthening or by relaxing assumptions underlying clustering or classification methods respectively. In this work, following the first route, we semi-supervise nonparametric clustering in order to identify a possible signal. The main contribution consists in tuning a nonparametric estimate of the density underlying the experimental data to identify a partition which guarantees a signal warning while allowing for an accurate classification of the background. As a side contribution, a variable selection procedure is presented. The whole procedure is tested on a dataset mimicking proton–proton collisions performed within a particle accelerator. While finding motivation in the field of particle physics, the approach is applicable to various science domains, where similar problems of anomaly detection arise.

Tài liệu tham khảo

Anderson NH, Hall P, Titterington DM (1994) Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. J Multivar Anal 50(1):41–54 Atlas-Collaboration (2012a) Observation of a new particle in the search for the standard model Higgs boson with the atlas detector at the LHC. Phys Lett B 716(1):1–29 Atlas-Collaboration (2012b) Observation of a new boson at a mass of 125 Gev with the CMS experiment at the LHC. Phys Lett B 716(1):30–61 Baldi P, Cranmer K, Faucett T, Sadowski P, Whiteson D (2016) Parameterized machine learning for high-energy physics. Eur Phys J C 76:235 Bouveyron C, Brunet-Saumard C (2014) Model-based clustering of high-dimensional data: a review. Comput Stat Data Anal 71:52–78 Burman P, Polonik W (2009) Multivariate mode hunting: data analytic tools with measures of significance. J Multivar Anal 100(6):1198–1218 Carmichael JW, George JA, Julius RS (1968) Finding natural clusters. Syst Zool 17(2):144–150 Chacón JE (2015) A population background for nonparametric density-based clustering. Stat Sci 30(4):518–532 Chacón JE, Duong T (2010) Multivariate plug-in bandwidth selection with unconstrained pilot bandwidth matrices. Test 19(2):375–398 Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):15 Chang WC (1983) On using principal components before separating a mixture of two multivariate normal distributions. J R Stat Soc Ser C (Appl Stat) 32(3):267–275 Chaudhuri P, Marron JS (1999) Sizer for exploration of structures in curves. J Am Stat Assoc 94(447):807–823 Casa A, Menardi G (2017) Signal detection in high energy physics via a semisupervised nonparametric approach. In: Proceedings of the conference of the Italian statistical Society “statistics and data sciences: new challenges, new generations”. Firenze. ISBN: 978-88-6453-521-0 Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39:1–22 Duong T (2018) ks: Kernel Smoothing. https://CRAN.R-project.org/package=ks, R package version 1.11.2 Duong T, Cowling A, Koch I, Wand MP (2008) Feature significance for multivariate kernel density estimation. Comput Stat Data Anal 52(9):4225–4242 Duong T, Goud B, Schauer K (2012) Closed-form density-based framework for automatic detection of cellular morphology changes. Proc Natl Acad Sci USA 109(22):8382–8387 Farina M, Nakai Y, Shih D (2018) Searching for new physics with deep autoencoders. arXiv preprint arXiv:180808992 Fasy BT, Lecci F, Rinaldo A, Wasserman L, Balakrishnan S, Singh A (2014) Confidence sets for persistence diagrams. Ann Stat 42(6):2301–2339 Fop M, Murphy TB (2018) Variable selection methods for model-based clustering. Stat Surv 12:18–65 Fukunaga K, Hostetler LD (1975) The estimation of the gradient of a density function, with application in pattern recognition. IEEE Trans Inf Theory 21(1):32–40 Genovese CR, Perone-Pacifico M, Verdinelli I, Wasserman L (2016) Non-parametric inference for density modes. J R Stat Soc B 78(1):99–126 Hennig C, Meila M, Murtagh F, Rocci R (2015) Handbook of cluster analysis. CRC Press, Hoboken Izenman AJ (2012) Introduction to manifold learning. Wiley Interdiscip Rev Comput Stat 4(5):439–446 John GH, Kohavi R, Pfleger K (1994) Irrelevant features and the subset selection problem. In: Machine learning proceedings 1994, Elsevier, pp 121–129 Lee JA, Verleysen M (2007) Nonlinear dimensionality reduction. Springer Science & Business Media, Berlin Ma Y, Fu Y (2011) Manifold learning theory and applications. CRC Press, Hoboken Menardi G (2016) A review on modal clustering. Int Stat Rev 84(3):413–433 Naimuddin M (2012) Model-independent search for new physics at d0 experiment. Pramana 79(5):1259–1262 Pruneau C (2017) Data analysis techniques for physical scientists. Cambridge University Press, Cambridge R Core Team (2018) R: A language and environment for statistical computing. R foundation for statistical computing, Vienna, Austria, https://www.R-project.org/ Ritter G (2014) Robust cluster analysis and variable selection. CRC Press, Hoboken Scott D (2015) Multivariate density estimation: theory, practice, and visualization. Wiley, New York Vatanen T, Kuusela M, Malmi E, Raiko T, Aaltonen T, Nagai Y (2012) Semi-supervised detection of collective anomalies with an application in high energy particle physics. In: Neural networks (IJCNN), The 2012 international joint conference on, IEEE, pp 1–8 Vischia P, Dorigo T (2017) The inverse bagging algorithm: Anomaly detection by inverse bootstrap aggregating. In: European physical journal web of conferences 137:11009 Wand MP, Jones MC (1995) Kernel smoothing. Chapman and Hall, London Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224 Zhu X (2011) Semi-supervised learning. In: Encyclopedia of machine learning. Springer, pp 892–897