Anomaly detection in the probability simplex under different geometries

Information Geometry - Tập 6 - Trang 385-412 - 2023

Uriel Legaria¹, Sergio Mota¹, Sergio Martinez¹, Alfredo Cobá², Argenis Chable², Antonio Neme³

¹Posgrado en Ciencia e Ingeniería de la Computación, Universidad Nacional Autónoma de México, Mexico City, Mexico

²Facultad de Matemáticas, Universidad Autonoma de Yucatan, Mérida, Mexico

³Instituto de Investigaciones en Matematicas Aplicadas y en Sistemas (IIMAS), Unidad Académica en el Estado de Yucatán, Universidad Nacional Autónoma de México, Mérida, Mexico

Tóm tắt

An open problem in data science is that of anomaly detection. Anomalies are instances that do not maintain a certain property that is present in the remaining observations in a dataset. Several anomaly detection algorithms exist, since the process itself is ill-posed mainly because the criteria that separates common or expected vectors from anomalies are not unique. In the most extreme case, data is not labelled and the algorithm has to identify the vectors that are anomalous, or assign a degree of anomaly to each vector. The majority of anomaly detection algorithms do not make any assumptions about the properties of the feature space in which observations are embedded, which may affect the results when those spaces present certain properties. For instance, compositional data such as normalized histograms, that can be embedded in a probability simplex, constitute a particularly relevant case. In this contribution, we address the problem of detecting anomalies in the probability simplex, relying on concepts from Information Geometry, mainly by focusing our efforts in the distance functions commonly applied in that context. We report the results of a series of experiments and conclude that when a specific distance-based anomaly detection algorithm relies on Information Geometry-related distance functions instead of the Euclidean distance, the performance is significantly improved.

Tài liệu tham khảo

Desai, J., Watson, D., Wang, V., Tadeo, M., Floridi, L.: The epistemological foundations of data science: a critical review. Synthese 200, 469 (2022). https://doi.org/10.1007/s11229-022-03933-2 Carmichael, I., Marron, J.S.: Data science vs. statistics: two cultures? Jpn. J. Stat. Data Sci. 1, 117–138 (2018). https://doi.org/10.1007/s42081-018-0009-3 Daoud, A., Dubhashi, D.: Statistical, modeling: the three cultures. Harvard Data Sci. Rev. (2023). https://doi.org/10.1162/99608f92.89f6fe66 Liberti, L.: Distance geometry and data science. TOP 28(2), 271–339 (2020). https://doi.org/10.1007/s11750-020-00563-0 Tukey, J.: Exploratory Data Analysis. Pearson, London (1977) Steinbach, M., Ertöz, L., Kumar, V.: The challenges of clustering high-dimensional data. In: New Vistas in Statistical Physics: Applications in Econophysics, Bioinformatics, and Pattern Recognition Epstein, C., Carlsson, G., Edelsbrunner, H.: Topological data analysis. Inverse Probl. 27(12), 120201 (2011). https://doi.org/10.1088/0266-5611/27/12/120201 Goldstein, M., Uchida, S.: A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data. PLoS ONE 11(4), e0152173 (2016). https://doi.org/10.1371/journal.pone.0152173 Tenenbaum, J.B., Silva, V., Langford, C.: A global geometric framework for nonlinear dimensionality reduction. Science 290(5500), 2319–2323 (2000). https://doi.org/10.1126/science.290.5500.2319 Lee, J., Verleysen, M.: Nonlinear Dimensionality Reduction. Springer, New York (2007) Aguayo, L., Barreto, G.: Novelty detection in time series using self-organizing neural networks: a comprehensive evaluation. Neural Proc. Lett. 1, 1 (2017). https://doi.org/10.1007/s11063-017-9679 Zimek, A., Schubert, E., Kriegel, P.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. (2012) Grubbs, F.E.: Sample criteria for testing outlying observations. Ann. Math. Stat. 21(1), 27–58 (1950). https://doi.org/10.1214/aoms/1177729885 Barnett, V., Lewis, T.: Outliers in Statistical Data. Wiley, New York (1978) Markou, M., Singh, M.: Novelty detection: a review-Part 1, statistical approaches. Signal Process. 83(12), 2481–2497 (2003). https://doi.org/10.1016/j.sigpro.2003.07.0 Ester, M., Kriegel, H.P., Sander, J., Xu, X., Xiaowei, E.S., Evangelos, H., Jiawei, F., Usama M. (eds.).: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), pp. 226–231. AAAI Press, Washington (1996) Brendan, J.F., Delbert, D.: Clustering by passing messages between data points. Science 315(5814), 972–976 (2007). https://doi.org/10.1126/science.1136800 Breunig, M., Kriegel, H.P., Ng, R., Sander, J., LOF: Identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93–104. SIGMOD. https://doi.org/10.1145/335191.335388. ISBN 1-58113-217-4 (2000) Pimentel, M., Clifton, D., Clifton, L., Tarassenko, L.: A review on novelty detection. Signal Process. 99, 215–249 (2014) Markou, M., Singh, M.: Novelty detection: a review-Part 2, neural network based approaches. Signal Process. 83(12), 2499–2521 (2003). https://doi.org/10.1016/j.sigpro.2003.07.019 Selicato, L., Esposito, F., Gargano, G., Vegliante, M.C., Opinto, G., Zaccaria, G.M., Ciavarella, S., Guarini, A., Del Buono, N.: A new ensemble method for detecting anomalies in gene expression matrices. Mathematic 9, 882 (2021). https://doi.org/10.3390/math9080882 Li, H.Z., Boulanger, P.: A survey of heart anomaly detection using ambulatory electrocardiogram (ECG). Sensors (Basel) 20(5), 1461 (2020). https://doi.org/10.3390/s20051461 Basora, L., Olive, X., Dubot, T.: Recent advances in anomaly detection methods applied to aviation. Aerospace 6(11), 117 (2019). https://doi.org/10.3390/aerospace6110117 Schwabacher, M., Oza, N., Matthews, B.: Unsupervised anomaly detection for liquid-fueled rocket propulsion health monitoring. J. Aerosp. Comput. Inf. Commun. 6, 7 (2009) Yepmo, G., Smits, G., Pivert, O.: Anomaly explanation: a review. Data Knowl. Eng. 137, 101946 (2022) Greenacre, M.: Compositional Data Analysis in Practice. CRC Press, London (2018) Aitchison, J.: The statistical analysis of compositional data. J. R. Stat. Soc. B 44(2), 139–177 (1982) Nielsen, F.: An elementary introduction to information geometry. Entropy 22(10), 1100 (2020). https://doi.org/10.3390/e22101100 Nielsen, F.: The many faces of information geometry. Notices AMS 69, 36–45 (2022) Rao, C.R.: Information and accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 37, 81–91 (1945) Deza, M., Deza, E.: Encyclopedia of Distances. Springer, New York (2018) Aitchison, J.: Principal component analysis of compositional data. Biometrika 70(1), 57–65 (1983) Nielsen, F., Sun, K.: Clustering in Hilbert simplex geometry. Clustering in Hilbert’s projective geometry: the case studies of the probability simplex and the elliptope of correlation matrices. In: Nielsen, F. (eds) Geometric Structures of Information. Signals and Communication Technology. Springer, Cham. https://doi.org/10.1007/978-3-030-02520-5_11 (2019) Avalos-Fernandez, M., Nock, R., Ong, C.S., Rouar, J., Sun, K.: Representation learning of compositional data. NIPS 18, 6680–6690 (2018). https://doi.org/10.5555/3327757.3327774 Bulmer, M.: Principles of Statistics. Dover Publications, New York (1979) Li, Q., McKenzie, D., Yin, W.: From the simplex to the sphere: faster constrained optimization using the Hadamard parametrization. arXiv:2112.05273. https://doi.org/10.48550/arXiv.2112.05273 (2022) Mehrotra, K., Mihan, C., Huang, H.: Anomaly Detection, Principles and Algorithms. Springer, New York (2017) Schubert, E., Zimek, A., Kriegel, H.P.: Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Min. Knowl. Discov. 28, 190–237 (2014). https://doi.org/10.1007/s10618-012-0300-z Liu, F.T., Ting, K.M., ZHou, Z.H.: Isolation forest. In: Eighth IEEE International Conference on Data Mining, pp. 413–422. https://doi.org/10.1109/ICDM.2008.17. ISBN 978-0-7695-3502-9. S2CID 6505449 (2008) Knorr, E., Ng, R., Tucakov, V.: Distance-based outliers: algorithms and applications. VLDB J. 8, 237–253 (2000). https://doi.org/10.1007/s007780050006 Iglewicz, B., Hoaglin, D.: How to Detect and Handle Outliers. American Society for Quality Control, New York (1993) Aguayo, L., Barreto, G.: Novelty detection in time series using self-organizing neural networks: a comprehensive evaluation. Neural Process. Lett. 47, 1 (2017). https://doi.org/10.1007/s11063-017-9679 Neme, A., Lugo, B., Cervera, A.: Authorship attribution as a case of anomaly detection: a neural network model. Int. J. Hybrid Intell. Syst. 8(4), 225–235 (2011) Neme, A., Gutierrez-Pulido, J., Muñoz, A., Hernández, S., Dey, T.: Stylistics analysis and authorship attribution algorithms based on self-organizing maps. Neurocomputing 147, 147–159 (2015) Forrest, S., Perelson, A.S., Allen, L., Cherukuri, R.: Self-nonself discrimination in a computer. In: Proceedings of the 1994 IEEE Symposium on Research in Security and Privacy, Los Alamitos, pp. 202–212 (1994) Wang, K., Langevin, S., Shattuck, M., Ogle, S., Kirby, M.: Anomaly detection in host signaling pathways for the early prognosis of acute infection. PLOS (2016). https://doi.org/10.1371/journal.pone.0160919 Wang, G., Yang, J., Li, R.: Imbalanced SVM-based anomaly detection algorithm for imbalanced training datasets. Electron. Telecommun. Res. Inst. 39–5, 621–631 (2017). https://doi.org/10.4218/etrij.17.0116.0879 Zhao, W., Li, L., Alam, S., Wang, Y.: An incremental clustering method for anomaly detection in flight data. Transport. Res. Part C Emerg. Technol. 132, 103406 (2021). https://doi.org/10.1016/j.trc.2021.103406 Evangelou, M., Adams, N.: An anomaly detection framework for cyber-security data. Comput. Secur. 97, 101941 (2021). https://doi.org/10.1016/j.cose.2020.101941 Novikova, E., Kotenko, I.: Visual analytics for detecting anomalous activity in mobile money transfer services. In: International Cross-Domain Conference and Workshop on Availability, Reliability,and Security (CD-ARES), Fribourg pp. 63–78. https://doi.org/10.1007/978-3-319-10975-65 (2014) Garrard, P., Maloney, L., Hodges, J., Patterson, K.: The effects of very early Alzheimer’s disease on the characteristics of writing by a renowned author. Brain 128(2), 250–260 (2005). https://doi.org/10.1093/brain/awh341 Close, L., Kashef, R.: Combining artificial immune system and clustering analysis: a stock market anomaly detection model. J. Intell. Learn. Syst. Appl. (2020). https://doi.org/10.4236/jilsa.2020.124005 Colignatus, T.: Comparing the Aitchison Distance and the Angular Distance for Use as Inequality or Disproportionality Measures for Votes and Seats (2018) Villani, C.: Optimal Transport, Old and New. Springer, New York. ISBN 978-3-540-71050-9 (2008) Bigot, J.: Statistical data analysis in the Wasserstein space. J. 2018 MAS Sampling Process. 68, 1–19 (2020). https://doi.org/10.1051/proc/202068001 Peyre, G., Cuturi, M.: Computational Optimal Transport. arXiv:1803.00567 (2018) Aler, R., Valss, J., Bostrom, H.: Study of Hellinger distance as a splitting metric for random forests in balanced and imbalanced classification datasets. Expert Syst. Appl. 1, 113264 (2020). https://doi.org/10.1016/j.eswa.2020.113264 Lavigne, C., Ricci, B., Franck, P., Senoussi, R.: Spatial analyses of ecological count data: a density map comparison approach. Basic Appl. Ecol. 11, 734–742 (2010) Menendez, M.L., Pardo, J.A., Pardo, M.: The Jensen–Shannon divergence. J. Franklin Inst. 334(2), 307–318 (1997). https://doi.org/10.1016/S0016-0032(96)00063-4 Coles, P., Cerezo, M., Cincio, L.: Strong bound between trace distance and Hilbert-Schmidt distance for low-rank states. Phys. Rev. A. 100(2), 022103 (2019). https://doi.org/10.1103/PhysRevA.100.022103 Gattone, S., Sanctis, A., Russo, T., Pulcini, D.: A shape distance based on the Fisher-Rao metric and its application for shapes clustering. Phys. A Stat. Mech. Appl. (2017). https://doi.org/10.1016/j.physa.2017.06.014 Hawkins, D.: Identification of Outliers. Springer, New York (1980) Nakamura, Y., Gojobori, T., Ikemura, T.: Codon usage tabulated from the international DNA sequence databases: status for the year 2000. Nucl. Acids Res. 28, 292 (2000) Khomtchouk, B.B.: Codon usage bias levels predict taxonomic identity and genetic composition. bioRxiv (2020). https://doi.org/10.1101/2020.10.26.356295 Nelson, D.L., Cox, M.M.: Principles of Biochemistry, 4th edn. W. H. Freeman, New York. ISBN 0-7167-4339-6 (2005) Parvathy, S.T., Udayasuriyan, V., Bhadana, V.: Codon usage bias. Mol. Biol. Rep. 49, 539–565 (2022). https://doi.org/10.1007/s11033-021-06749-4 Prat, Y., Fromer, M., Linial, N.: Codon usage is associated with the evolutionary age of genes in metazoan genomes. BMC Evol. Biol. 9, 285 (2009). https://doi.org/10.1186/1471-2148-9-285 Pearson, K.: A First Study of the Statistics of Pulmonary Tuberculosis. Dalau, London (1907) Poincare, H.: Analysis Situs. Translated version from French (1895) Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982) Shannon, C.E.A.: Mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423, 623–656 (2020) https://doi.org/10.1002/j.1538-7305.1948.tb01338.x

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA