Nonparametric semi-supervised classification with application to signal detection in high energy physics
Tóm tắt
Model-independent searches in particle physics aim at completing our knowledge of the universe by looking for new possible particles not predicted by the current theories. Such particles, referred to as signal, are expected to behave as a deviation from the background, representing the known physics. Information available on the background can be incorporated in the search, in order to identify potential anomalies. From a statistical perspective, the problem is recasted to a peculiar classification one where only partial information is accessible. Therefore a semi-supervised approach shall be adopted, either by strengthening or by relaxing assumptions underlying clustering or classification methods respectively. In this work, following the first route, we semi-supervise nonparametric clustering in order to identify a possible signal. The main contribution consists in tuning a nonparametric estimate of the density underlying the experimental data to identify a partition which guarantees a signal warning while allowing for an accurate classification of the background. As a side contribution, a variable selection procedure is presented. The whole procedure is tested on a dataset mimicking proton–proton collisions performed within a particle accelerator. While finding motivation in the field of particle physics, the approach is applicable to various science domains, where similar problems of anomaly detection arise.
Tài liệu tham khảo
Anderson NH, Hall P, Titterington DM (1994) Two-sample test statistics for measuring discrepancies between two multivariate probability density functions using kernel-based density estimates. J Multivar Anal 50(1):41–54
Atlas-Collaboration (2012a) Observation of a new particle in the search for the standard model Higgs boson with the atlas detector at the LHC. Phys Lett B 716(1):1–29
Atlas-Collaboration (2012b) Observation of a new boson at a mass of 125 Gev with the CMS experiment at the LHC. Phys Lett B 716(1):30–61
Baldi P, Cranmer K, Faucett T, Sadowski P, Whiteson D (2016) Parameterized machine learning for high-energy physics. Eur Phys J C 76:235
Bouveyron C, Brunet-Saumard C (2014) Model-based clustering of high-dimensional data: a review. Comput Stat Data Anal 71:52–78
Burman P, Polonik W (2009) Multivariate mode hunting: data analytic tools with measures of significance. J Multivar Anal 100(6):1198–1218
Carmichael JW, George JA, Julius RS (1968) Finding natural clusters. Syst Zool 17(2):144–150
Chacón JE (2015) A population background for nonparametric density-based clustering. Stat Sci 30(4):518–532
Chacón JE, Duong T (2010) Multivariate plug-in bandwidth selection with unconstrained pilot bandwidth matrices. Test 19(2):375–398
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):15
Chang WC (1983) On using principal components before separating a mixture of two multivariate normal distributions. J R Stat Soc Ser C (Appl Stat) 32(3):267–275
Chaudhuri P, Marron JS (1999) Sizer for exploration of structures in curves. J Am Stat Assoc 94(447):807–823
Casa A, Menardi G (2017) Signal detection in high energy physics via a semisupervised nonparametric approach. In: Proceedings of the conference of the Italian statistical Society “statistics and data sciences: new challenges, new generations”. Firenze. ISBN: 978-88-6453-521-0
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc B 39:1–22
Duong T (2018) ks: Kernel Smoothing. https://CRAN.R-project.org/package=ks, R package version 1.11.2
Duong T, Cowling A, Koch I, Wand MP (2008) Feature significance for multivariate kernel density estimation. Comput Stat Data Anal 52(9):4225–4242
Duong T, Goud B, Schauer K (2012) Closed-form density-based framework for automatic detection of cellular morphology changes. Proc Natl Acad Sci USA 109(22):8382–8387
Farina M, Nakai Y, Shih D (2018) Searching for new physics with deep autoencoders. arXiv preprint arXiv:180808992
Fasy BT, Lecci F, Rinaldo A, Wasserman L, Balakrishnan S, Singh A (2014) Confidence sets for persistence diagrams. Ann Stat 42(6):2301–2339
Fop M, Murphy TB (2018) Variable selection methods for model-based clustering. Stat Surv 12:18–65
Fukunaga K, Hostetler LD (1975) The estimation of the gradient of a density function, with application in pattern recognition. IEEE Trans Inf Theory 21(1):32–40
Genovese CR, Perone-Pacifico M, Verdinelli I, Wasserman L (2016) Non-parametric inference for density modes. J R Stat Soc B 78(1):99–126
Hennig C, Meila M, Murtagh F, Rocci R (2015) Handbook of cluster analysis. CRC Press, Hoboken
Izenman AJ (2012) Introduction to manifold learning. Wiley Interdiscip Rev Comput Stat 4(5):439–446
John GH, Kohavi R, Pfleger K (1994) Irrelevant features and the subset selection problem. In: Machine learning proceedings 1994, Elsevier, pp 121–129
Lee JA, Verleysen M (2007) Nonlinear dimensionality reduction. Springer Science & Business Media, Berlin
Ma Y, Fu Y (2011) Manifold learning theory and applications. CRC Press, Hoboken
Menardi G (2016) A review on modal clustering. Int Stat Rev 84(3):413–433
Naimuddin M (2012) Model-independent search for new physics at d0 experiment. Pramana 79(5):1259–1262
Pruneau C (2017) Data analysis techniques for physical scientists. Cambridge University Press, Cambridge
R Core Team (2018) R: A language and environment for statistical computing. R foundation for statistical computing, Vienna, Austria, https://www.R-project.org/
Ritter G (2014) Robust cluster analysis and variable selection. CRC Press, Hoboken
Scott D (2015) Multivariate density estimation: theory, practice, and visualization. Wiley, New York
Vatanen T, Kuusela M, Malmi E, Raiko T, Aaltonen T, Nagai Y (2012) Semi-supervised detection of collective anomalies with an application in high energy particle physics. In: Neural networks (IJCNN), The 2012 international joint conference on, IEEE, pp 1–8
Vischia P, Dorigo T (2017) The inverse bagging algorithm: Anomaly detection by inverse bootstrap aggregating. In: European physical journal web of conferences 137:11009
Wand MP, Jones MC (1995) Kernel smoothing. Chapman and Hall, London
Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224
Zhu X (2011) Semi-supervised learning. In: Encyclopedia of machine learning. Springer, pp 892–897