Anomaly detection in the probability simplex under different geometries
Tóm tắt
An open problem in data science is that of anomaly detection. Anomalies are instances that do not maintain a certain property that is present in the remaining observations in a dataset. Several anomaly detection algorithms exist, since the process itself is ill-posed mainly because the criteria that separates common or expected vectors from anomalies are not unique. In the most extreme case, data is not labelled and the algorithm has to identify the vectors that are anomalous, or assign a degree of anomaly to each vector. The majority of anomaly detection algorithms do not make any assumptions about the properties of the feature space in which observations are embedded, which may affect the results when those spaces present certain properties. For instance, compositional data such as normalized histograms, that can be embedded in a probability simplex, constitute a particularly relevant case. In this contribution, we address the problem of detecting anomalies in the probability simplex, relying on concepts from Information Geometry, mainly by focusing our efforts in the distance functions commonly applied in that context. We report the results of a series of experiments and conclude that when a specific distance-based anomaly detection algorithm relies on Information Geometry-related distance functions instead of the Euclidean distance, the performance is significantly improved.
Tài liệu tham khảo
Desai, J., Watson, D., Wang, V., Tadeo, M., Floridi, L.: The epistemological foundations of data science: a critical review. Synthese 200, 469 (2022). https://doi.org/10.1007/s11229-022-03933-2
Carmichael, I., Marron, J.S.: Data science vs. statistics: two cultures? Jpn. J. Stat. Data Sci. 1, 117–138 (2018). https://doi.org/10.1007/s42081-018-0009-3
Daoud, A., Dubhashi, D.: Statistical, modeling: the three cultures. Harvard Data Sci. Rev. (2023). https://doi.org/10.1162/99608f92.89f6fe66
Liberti, L.: Distance geometry and data science. TOP 28(2), 271–339 (2020). https://doi.org/10.1007/s11750-020-00563-0
Tukey, J.: Exploratory Data Analysis. Pearson, London (1977)
Steinbach, M., Ertöz, L., Kumar, V.: The challenges of clustering high-dimensional data. In: New Vistas in Statistical Physics: Applications in Econophysics, Bioinformatics, and Pattern Recognition
Epstein, C., Carlsson, G., Edelsbrunner, H.: Topological data analysis. Inverse Probl. 27(12), 120201 (2011). https://doi.org/10.1088/0266-5611/27/12/120201
Goldstein, M., Uchida, S.: A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLoS ONE 11(4), e0152173 (2016). https://doi.org/10.1371/journal.pone.0152173
Tenenbaum, J.B., Silva, V., Langford, C.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000). https://doi.org/10.1126/science.290.5500.2319
Lee, J., Verleysen, M.: Nonlinear Dimensionality Reduction. Springer, New York (2007)
Aguayo, L., Barreto, G.: Novelty detection in time series using self-organizing neural networks: a comprehensive evaluation. Neural Proc. Lett. 1, 1 (2017). https://doi.org/10.1007/s11063-017-9679
Zimek, A., Schubert, E., Kriegel, P.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. (2012)
Grubbs, F.E.: Sample criteria for testing outlying observations. Ann. Math. Stat. 21(1), 27–58 (1950). https://doi.org/10.1214/aoms/1177729885
Barnett, V., Lewis, T.: Outliers in Statistical Data. Wiley, New York (1978)
Markou, M., Singh, M.: Novelty detection: a review-Part 1, statistical approaches. Signal Process. 83(12), 2481–2497 (2003). https://doi.org/10.1016/j.sigpro.2003.07.0
Ester, M., Kriegel, H.P., Sander, J., Xu, X., Xiaowei, E.S., Evangelos, H., Jiawei, F., Usama M. (eds.).: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pp. 226–231. AAAI Press, Washington (1996)
Brendan, J.F., Delbert, D.: Clustering by passing messages between data points. Science 315(5814), 972–976 (2007). https://doi.org/10.1126/science.1136800
Breunig, M., Kriegel, H.P., Ng, R., Sander, J., LOF: Identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93–104. SIGMOD. https://doi.org/10.1145/335191.335388. ISBN 1-58113-217-4 (2000)
Pimentel, M., Clifton, D., Clifton, L., Tarassenko, L.: A review on novelty detection. Signal Process. 99, 215–249 (2014)
Markou, M., Singh, M.: Novelty detection: a review-Part 2, neural network based approaches. Signal Process. 83(12), 2499–2521 (2003). https://doi.org/10.1016/j.sigpro.2003.07.019
Selicato, L., Esposito, F., Gargano, G., Vegliante, M.C., Opinto, G., Zaccaria, G.M., Ciavarella, S., Guarini, A., Del Buono, N.: A new ensemble method for detecting anomalies in gene expression matrices. Mathematic 9, 882 (2021). https://doi.org/10.3390/math9080882
Li, H.Z., Boulanger, P.: A survey of heart anomaly detection using ambulatory electrocardiogram (ECG). Sensors (Basel) 20(5), 1461 (2020). https://doi.org/10.3390/s20051461
Basora, L., Olive, X., Dubot, T.: Recent advances in anomaly detection methods applied to aviation. Aerospace 6(11), 117 (2019). https://doi.org/10.3390/aerospace6110117
Schwabacher, M., Oza, N., Matthews, B.: Unsupervised anomaly detection for liquid-fueled rocket propulsion health monitoring. J. Aerosp. Comput. Inf. Commun. 6, 7 (2009)
Yepmo, G., Smits, G., Pivert, O.: Anomaly explanation: a review. Data Knowl. Eng. 137, 101946 (2022)
Greenacre, M.: Compositional Data Analysis in Practice. CRC Press, London (2018)
Aitchison, J.: The statistical analysis of compositional data. J. R. Stat. Soc. B 44(2), 139–177 (1982)
Nielsen, F.: An elementary introduction to information geometry. Entropy 22(10), 1100 (2020). https://doi.org/10.3390/e22101100
Nielsen, F.: The many faces of information geometry. Notices AMS 69, 36–45 (2022)
Rao, C.R.: Information and accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 37, 81–91 (1945)
Deza, M., Deza, E.: Encyclopedia of Distances. Springer, New York (2018)
Aitchison, J.: Principal component analysis of compositional data. Biometrika 70(1), 57–65 (1983)
Nielsen, F., Sun, K.: Clustering in Hilbert simplex geometry. Clustering in Hilbert’s projective geometry: the case studies of the probability simplex and the elliptope of correlation matrices. In: Nielsen, F. (eds) Geometric Structures of Information. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-030-02520-5_11 (2019)
Avalos-Fernandez, M., Nock, R., Ong, C.S., Rouar, J., Sun, K.: Representation learning of compositional data. NIPS 18, 6680–6690 (2018). https://doi.org/10.5555/3327757.3327774
Bulmer, M.: Principles of Statistics. Dover Publications, New York (1979)
Li, Q., McKenzie, D., Yin, W.: From the simplex to the sphere: faster constrained optimization using the Hadamard parametrization. arXiv:2112.05273. https://doi.org/10.48550/arXiv.2112.05273 (2022)
Mehrotra, K., Mihan, C., Huang, H.: Anomaly Detection, Principles and Algorithms. Springer, New York (2017)
Schubert, E., Zimek, A., Kriegel, H.P.: Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Min. Knowl. Discov. 28, 190–237 (2014). https://doi.org/10.1007/s10618-012-0300-z
Liu, F.T., Ting, K.M., ZHou, Z.H.: Isolation forest. In: Eighth IEEE International Conference on Data Mining, pp. 413–422. https://doi.org/10.1109/ICDM.2008.17. ISBN 978-0-7695-3502-9. S2CID 6505449 (2008)
Knorr, E., Ng, R., Tucakov, V.: Distance-based outliers: algorithms and applications. VLDB J. 8, 237–253 (2000). https://doi.org/10.1007/s007780050006
Iglewicz, B., Hoaglin, D.: How to Detect and Handle Outliers. American Society for Quality Control, New York (1993)
Aguayo, L., Barreto, G.: Novelty detection in time series using self-organizing neural networks: a comprehensive evaluation. Neural Process. Lett. 47, 1 (2017). https://doi.org/10.1007/s11063-017-9679
Neme, A., Lugo, B., Cervera, A.: Authorship attribution as a case of anomaly detection: a neural network model. Int. J. Hybrid Intell. Syst. 8(4), 225–235 (2011)
Neme, A., Gutierrez-Pulido, J., Muñoz, A., Hernández, S., Dey, T.: Stylistics analysis and authorship attribution algorithms based on self-organizing maps. Neurocomputing 147, 147–159 (2015)
Forrest, S., Perelson, A.S., Allen, L., Cherukuri, R.: Self-nonself discrimination in a computer. In: Proceedings of the 1994 IEEE Symposium on Research in Security and Privacy, Los Alamitos, pp. 202–212 (1994)
Wang, K., Langevin, S., Shattuck, M., Ogle, S., Kirby, M.: Anomaly detection in host signaling pathways for the early prognosis of acute infection. PLOS (2016). https://doi.org/10.1371/journal.pone.0160919
Wang, G., Yang, J., Li, R.: Imbalanced SVM-based anomaly detection algorithm for imbalanced training datasets. Electron. Telecommun. Res. Inst. 39–5, 621–631 (2017). https://doi.org/10.4218/etrij.17.0116.0879
Zhao, W., Li, L., Alam, S., Wang, Y.: An incremental clustering method for anomaly detection in flight data. Transport. Res. Part C Emerg. Technol. 132, 103406 (2021). https://doi.org/10.1016/j.trc.2021.103406
Evangelou, M., Adams, N.: An anomaly detection framework for cyber-security data. Comput. Secur. 97, 101941 (2021). https://doi.org/10.1016/j.cose.2020.101941
Novikova, E., Kotenko, I.: Visual analytics for detecting anomalous activity in mobile money transfer services. In: International Cross-Domain Conference and Workshop on Availability, Reliability,and Security (CD-ARES), Fribourg pp. 63–78. https://doi.org/10.1007/978-3-319-10975-65 (2014)
Garrard, P., Maloney, L., Hodges, J., Patterson, K.: The effects of very early Alzheimer’s disease on the characteristics of writing by a renowned author. Brain 128(2), 250–260 (2005). https://doi.org/10.1093/brain/awh341
Close, L., Kashef, R.: Combining artificial immune system and clustering analysis: a stock market anomaly detection model. J. Intell. Learn. Syst. Appl. (2020). https://doi.org/10.4236/jilsa.2020.124005
Colignatus, T.: Comparing the Aitchison Distance and the Angular Distance for Use as Inequality or Disproportionality Measures for Votes and Seats (2018)
Villani, C.: Optimal Transport, Old and New. Springer, New York. ISBN 978-3-540-71050-9 (2008)
Bigot, J.: Statistical data analysis in the Wasserstein space. J. 2018 MAS Sampling Process. 68, 1–19 (2020). https://doi.org/10.1051/proc/202068001
Peyre, G., Cuturi, M.: Computational Optimal Transport. arXiv:1803.00567 (2018)
Aler, R., Valss, J., Bostrom, H.: Study of Hellinger distance as a splitting metric for random forests in balanced and imbalanced classification datasets. Expert Syst. Appl. 1, 113264 (2020). https://doi.org/10.1016/j.eswa.2020.113264
Lavigne, C., Ricci, B., Franck, P., Senoussi, R.: Spatial analyses of ecological count data: a density map comparison approach. Basic Appl. Ecol. 11, 734–742 (2010)
Menendez, M.L., Pardo, J.A., Pardo, M.: The Jensen–Shannon divergence. J. Franklin Inst. 334(2), 307–318 (1997). https://doi.org/10.1016/S0016-0032(96)00063-4
Coles, P., Cerezo, M., Cincio, L.: Strong bound between trace distance and Hilbert-Schmidt distance for low-rank states. Phys. Rev. A. 100(2), 022103 (2019). https://doi.org/10.1103/PhysRevA.100.022103
Gattone, S., Sanctis, A., Russo, T., Pulcini, D.: A shape distance based on the Fisher-Rao metric and its application for shapes clustering. Phys. A Stat. Mech. Appl. (2017). https://doi.org/10.1016/j.physa.2017.06.014
Hawkins, D.: Identification of Outliers. Springer, New York (1980)
Nakamura, Y., Gojobori, T., Ikemura, T.: Codon usage tabulated from the international DNA sequence databases: status for the year 2000. Nucl. Acids Res. 28, 292 (2000)
Khomtchouk, B.B.: Codon usage bias levels predict taxonomic identity and genetic composition. bioRxiv (2020). https://doi.org/10.1101/2020.10.26.356295
Nelson, D.L., Cox, M.M.: Principles of Biochemistry, 4th edn. W. H. Freeman, New York. ISBN 0-7167-4339-6 (2005)
Parvathy, S.T., Udayasuriyan, V., Bhadana, V.: Codon usage bias. Mol. Biol. Rep. 49, 539–565 (2022). https://doi.org/10.1007/s11033-021-06749-4
Prat, Y., Fromer, M., Linial, N.: Codon usage is associated with the evolutionary age of genes in metazoan genomes. BMC Evol. Biol. 9, 285 (2009). https://doi.org/10.1186/1471-2148-9-285
Pearson, K.: A First Study of the Statistics of Pulmonary Tuberculosis. Dalau, London (1907)
Poincare, H.: Analysis Situs. Translated version from French (1895)
Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)
Shannon, C.E.A.: Mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423, 623–656 (2020) https://doi.org/10.1002/j.1538-7305.1948.tb01338.x