A New Dimensionality-Unbiased Score for Efficient and Effective Outlying Aspect Mining

Data Science and Engineering - Tập 7 Số 2 - Trang 120-135 - 2022
Durgesh Samariya1, Jiangang Ma1
1School of Engineering, Information Technology and Physical Sciences, Federation University, Berwick, VIC, Australia

Tóm tắt

AbstractThe main aim of the outlying aspect mining algorithm is to automatically detect the subspace(s) (a.k.a. aspect(s)), where a given data point is dramatically different than the rest of the data in each of those subspace(s) (aspect(s)). To rank the subspaces for a given data point, a scoring measure is required to compute the outlying degree of the given data in each subspace. In this paper, we introduce a new measure to compute outlying degree, called Simple Isolation score using Nearest Neighbor Ensemble (SiNNE), which not only detects the outliers but also provides an explanation on why the selected point is an outlier. SiNNE is a dimensionally unbias measure in its raw form, which means the scores produced by SiNNE are compared directly with subspaces having different dimensions. Thus, it does not require any normalization to make the score unbiased. Our experimental results on synthetic and publicly available real-world datasets revealed that (i) SiNNE produces better or at least the same results as existing scores. (ii) It improves the run time of the existing outlying aspect mining algorithm based on beam search by at least two orders of magnitude. SiNNE allows the existing outlying aspect mining algorithm to run in datasets with hundreds of thousands of instances and thousands of dimensions which was not possible before.

Từ khóa


Tài liệu tham khảo

Angiulli F, Fassetti F, Manco G, Palopoli L (2017) Outlying property detection with numerical attributes. Data Min Knowl Disc 31(1):134–163

Bandaragoda TR, Ting KM, Albrecht D, Liu FT, Wells JR (2014) Efficient anomaly detection by isolation using nearest neighbour ensemble. In: 2014 IEEE international conference on data mining workshop, pp 698–705

Bandaragoda TR, Ting KM, Albrecht D, Liu FT, Zhu Y, Wells JR (2018) Isolation-based anomaly detection using nearest-neighbor ensembles. Comput Intell 34(4):968–998. https://doi.org/10.1111/coin.12156

Brockett PL, Xia X, Derrig RA (1998) Using Kohonen’s self-organizing feature map to uncover automobile bodily injury claims fraud. J Risk Insur 65(2):245–274. http://www.jstor.org/stable/253535

Campos GO, Zimek A, Sander J, Campello RJGB, Micenková B, Schubert E, Assent I, Houle ME (2016) On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Disc 30(4):891–927. https://doi.org/10.1007/s10618-015-0444-8

Chan PK, Fan W, Prodromidis AL, Stolfo SJ (1999) Distributed data mining in credit card fraud detection. IEEE Intell Syst Appl 14(6):67–74

Dang XH, Micenková B, Assent I, Ng RT (2013) Local outlier detection with interpretation. In: Blockeel H, Kersting K, Nijssen S, Železný F (eds) Machine learning and knowledge discovery in databases. Springer Berlin Heidelberg, Berlin, pp 304–320

Duan L, Tang G, Pei J, Bailey J, Campbell A, Tang C (2015) Mining outlying aspects on numeric data. Data Min Knowl Disc 29(5):1116–1151. https://doi.org/10.1007/s10618-014-0398-2

Gupta N, Eswaran D, Shah N, Akoglu L, Faloutsos C (2019) Beyond outlier detection: lookout for pictorial explanation. In: Berlingerio M, Bonchi F, Gärtner T, Hurley N, Ifrim G (eds) Machine learning and knowledge discovery in databases. Springer, Cham, pp 122–138

Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: an update. SIGKDD Explor Newsl 11(1):10–18. https://doi.org/10.1145/1656274.1656278

Härdle W (2012) Smoothing techniques: with implementation in S. Springer, New York

Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, Wieser E, Taylor J, Berg S, Smith NJ, Kern R, Picus M, Hoyer S, van Kerkwijk MH, Brett M, Haldane A, del Río JF, Wiebe M, Peterson P, Gérard-Marchant P, Sheppard K, Reddy T, Weckesser W, Abbasi H, Gohlke C, Oliphant TE (2020) Array programming with NumPy. Nature 585(7825):357–362. https://doi.org/10.1038/s41586-020-2649-2

Keller F, Muller E, Bohm K (2012) Hics: high contrast subspaces for density-based outlier ranking. In: Proceedings of the 2012 IEEE 28th International Conference on Data Engineering, IEEE Computer Society, Washington, DC, USA, ICDE’12, pp 1037–1048, https://doi.org/10.1109/ICDE.2012.88

Lin J, Keogh E, Ada Fu, Van Herle H (2005) Approximations to magic: finding unusual medical time series. In: 18th IEEE symposium on computer-based medical systems (CBMS’05), pp 329–334

Liu FT, Ting KM, Zhou Z (2008) Isolation forest. In: 2008 Eighth IEEE international conference on data mining, pp 413–422

Liu N, Shin D, Hu X (2018) Contextual outlier interpretation. In: Proceedings of the 27th international joint conference on artificial intelligence. AAAI Press, IJCAI’18, pp 2461–2467

Mejía-Lavalle M, Sánchez Vivar A (2009) Outlier detection with explanation facility. In: Perner P (ed) Machine learning and data mining in pattern recognition. Springer Berlin Heidelberg, Berlin, pp 454–464

Micenková B, Ng RT, Dang X, Assent I (2013) Explaining outliers by subspace separability. In: 2013 IEEE 13th international conference on data mining, pp 518–527, https://doi.org/10.1109/ICDM.2013.132

Muandet K, Fukumizu K, Sriperumbudur B, Schölkopf B (2017) Kernel mean embedding of distributions: a review and beyond. Found Trends Mach Learn 10(1–2):1–141

Samariya D, Ma J (2021) Mining outlying aspects on healthcare data. In: Siuly S, Wang H, Chen L, Guo Y, Xing C (eds) Health information science. Springer, Cham, pp 160–170

Samariya D, Thakkar A (2021) A comprehensive survey of anomaly detection algorithms. Ann Data Sci. https://doi.org/10.1007/s40745-021-00362-9

Samariya D, Aryal S, Ting KM, Ma J (2020) A new effective and efficient measure for outlying aspect mining. In: Huang Z, Beek W, Wang H, Zhou R, Zhang Y (eds) Web information systems engineering—WISE 2020. Springer, Cham, pp 463–474

Samariya D, Ma J, Aryal S (2020b) A comprehensive survey on outlying aspect mining methods. arXiv preprint arXiv:2005.02637

Silverman BW (1986) Density estimation for statistics and data analysis. Chapman & Hall, London

Tange O (2020) Gnu parallel 20201022 (‘samuelpaty’). Zenodo. https://doi.org/10.5281/zenodo.4118697

Vinh NX, Chan J, Bailey J, Leckie C, Ramamohanarao K, Pei J (2015) Scalable outlying-inlying aspects discovery via feature ranking. In: Cao T, Lim EP, Zhou ZH, Ho TB, Cheung D, Motoda H (eds) Advances in knowledge discovery and data mining. Springer, Cham, pp 422–434

Vinh NX, Chan J, Romano S, Bailey J, Leckie C, Ramamohanarao K, Pei J (2016) Discovering outlying aspects in large datasets. Data Min Knowl Disc 30(6):1520–1555. https://doi.org/10.1007/s10618-016-0453-2

Wells JR, Ting KM (2019) A new simple and efficient density estimator that enables fast systematic search. Pattern Recognit Lett 122:92–98. https://doi.org/10.1016/j.patrec.2018.12.020

Xu H, Wang Y, Jian S, Huang Z, Wang Y, Liu N, Li F (2021) Beyond outlier detection: Outlier interpretation by attention-guided triplet deviation network. In: Proceedings of the web conference 2021, association for computing machinery, New York, NY, USA, WWW’21, pp 1328–1339, https://doi.org/10.1145/3442381.3449868

Xu YX, Pang M, Feng J, Ting KM, Jiang Y, Zhou ZH (2021) Reconstruction-based anomaly detection with completely random forest. In: Proceedings of the 2021 SIAM international conference on data mining (SDM), SIAM, pp 127–135

Zhang J, Lou M, Ling TW, Wang H (2004) Hos-miner: a system for detecting outlyting subspaces of high-dimensional data. In: Proceedings of the thirtieth international conference on very large data bases—volume 30, VLDB endowment, Toronto, Canada, VLDB’04, pp 1265–1268, http://dl.acm.org/citation.cfm?id=1316689.1316810

Zhang J, Marszałek M, Lazebnik S, Schmid C (2007) Local features and kernels for classification of texture and object categories: a comprehensive study. Int J Comput Vis 73(2):213–238. https://doi.org/10.1007/s11263-006-9794-4