Clustering refinement

International Journal of Data Science and Analytics - Tập 12 - Trang 333-353 - 2021
Félix Iglesias1, Tanja Zseby1, Arthur Zimek2
1Institute of Telecommunications, TU Wien, Vienna, Austria
2Department of Mathematics and Computer Science (IMADA), University of Southern Denmark (SDU), Odense M, Denmark

Tóm tắt

Advanced validation of cluster analysis is expected to increase confidence and allow reliable implementations. In this work, we describe and test CluReAL, an algorithm for refining clustering irrespective of the method used in the first place. Moreover, we present ideograms that enable summarizing and properly interpreting problem spaces that have been clustered. The presented techniques are built on absolute cluster validity indices. Experiments cover a wide variety of scenarios and six of the most popular clustering techniques. Results show the potential of CluReAL for enhancing clustering and the suitability of ideograms to understand the context of the data through the lens of the cluster analysis. Refinement and interpretability are both crucial to reduce failure and increase performance control and operational awareness in unsupervised analysis.

Tài liệu tham khảo

Ankerst, M., Breunig, M., Kriegel, H.P., Sander, J.: Optics: Ordering points to identify the clustering structure. SIGMOD Rec. 28(2), 49–60 (1999) Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J.M., Perona, I.: An extensive comparative study of cluster validity indices. Pattern Recogn. 46(1), 243–256 (2013) Arthur, D., Vassilvitskii, S.: How slow is the k-means method? In: Proceedings of the Twenty-Second Annual Symposium on Computational Geometry, Association for Computing Machinery, New York, NY, USA, SCG ’06, pp 144–153, 10.1145/1137856.1137880 (2006) Basak, J., Krishnapuram, R.: Interpretable hierarchical clustering by constructing an unsupervised decision tree. IEEE Trans. Knowl. Data Eng. 17(1), 121–132 (2005) Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 22(7), 719–725 (2000) Bouchachia, A., Pedrycz, W.: Enhancement of fuzzy clustering by mechanisms of partial supervision. Fuzzy Sets Syst. 157(13), 1733–1759 (2006) Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. 3(1), 1–27 (1974) Campello, R.J., Moulavi, D., Zimek, A., Sander, J.: A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies. Data Mining Knowl. Discovery 27(3), 344–371 (2013) Campello, R.J.G.B., Moulavi, D., Zimek, A., Sander, J.: Hierarchical density estimates for data clustering, visualization and outlier detection. TKDD 10(1), 1–51 (2015) Campello, R.J.G.B., Kröger, P., Sander, J., Zimek, A.: Density-based clustering. Wiley Interdiscip Rev Data Min Knowl Discov 10(2), 10.1002/widm.1343 (2020) Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. PAMI 1(2), 224–227 (1979) Day, W.H.E., Edelsbrunner, H.: Efficient algorithms for agglomerative hierarchical clustering methods. J. Classif. 1(1), 7–24 (1984) Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. J. Royal Stat. Soc.: Ser. B (Methodol.) 39(1), 1–22 (1977) Demšar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006) Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, AAAI Press, KDD’96, p 226–231 (1996) Fränti, P., Virmajoki, O.: Iterative shrinking method for clustering problems. Pattern Recogn. 39(5), 761–765 (2006). https://doi.org/10.1016/j.patcog.2005.09.012 Fränti, P., Virmajoki, O., Hautamäki, V.: Fast agglomerative clustering using a k-nearest neighbor graph. IEEE Trans. Pattern Anal. Mach. Intell. 28(11), 1875–1881 (2006) Heine, C., Scheuermann, G.: Manual clustering refinement using interaction with blobs. In: Proceedings of the 9th Joint Eurographics / IEEE VGTC Conference on Visualization, Eurographics Association, Goslar, DEU, EUROVIS-07, p 59–66 (2007) Iglesias, F., Zseby, T., Ferreira, D., Zimek, A.: Mdcgen: Multidimensional dataset generator for clustering. J. Classif. 36, 599–618 (2019) Iglesias, F., Zseby, T., Zimek, A.: Absolute cluster validity. IEEE Trans. Pattern Anal. Mach. Intell. 42(9), 2096–2112 (2020a) Iglesias, F., Zseby, T., Zimek, A.: Interpretability and refinement of clustering. In: 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), pp 21–29, 10.1109/DSAA49011.2020.00014 (2020b) Jaccard, P.: The distribution of the flora in the alpine zone 1. New Phytol. 11(2), 37–50 (1912) Kärkkäinen, I., Fränti, P.: Dynamic local search algorithm for the clustering problem. Tech. Rep. A-2002-6, Department of Computer Science, University of Joensuu, Joensuu, Finland (2002) Karypis, G.: Cluto-a clustering toolkit. Tech. rep., Minnesota Univ. Minneapolis Dept. of Computer Science (2002) Kriegel, H., Kröger, P., Zimek, A.: Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. TKDD 3(1), 1–58 (2009) Liu, Y., Li, Z., Xiong, H., Gao, X., Wu, J., Wu, S.: Understanding and enhancement of internal clustering validation measures. IEEE Trans. Cybern. 43(3), 982–994 (2013) Lloyd, S.: Least squares quantization in pcm. IEEE Trans. Inf. Theory 28(2), 129–137 (1982) van der Maaten, L., Hinton, G.: Visualizing data using t-sne. Journal of Machine Learning Research 9(86):2579–2605, http://jmlr.org/papers/v9/vandermaaten08a.html (2008) McInnes, L., Healy, J., Astels, S.: hdbscan: Hierarchical density based clustering. Journal of Open Source Software 2(11): 205 (2017) Mirkin, B.: Choosing the number of clusters. WIREs Data Mining Knowl. Discovery 1(3), 252–260 (2011) Moulavi, D., Jaskowiak, P.A., Campello R.J.G.B., Zimek, A., Sander, J.: Density-based clustering validation. In: SDM, SIAM, pp 839–847 (2014) Murtagh, F.: Counting dendrograms: A survey. Discrete Appl. Math. 7(2), 191–199 (1984) Rahmah, N., Sitanggang, I.S.: Determination of optimal epsilon (eps) value on DBSCAN algorithm to clustering data on peatland hotspots in sumatra. IOP Conference Series: Earth and Environmental Science 31, 012012 (2016). https://doi.org/10.1088/1755-1315/31/1/012012 Rand, W.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971) Raykar, V.C., Duraiswami, R., Zhao, L.H.: Fast computation of kernel estimators. J. Comput. Graphical Stat. 19(1), 205–220 (2010) Rezaei, M., Fränti, P.: Set-matching methods for external cluster validity. IEEE Trans. Knowl. Data Eng. 28(8), 2173–2186 (2016) Rousseeuw, P.J.: Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987) Sander, J., Ester, M., Kriegel, H.P., Xu, X.: Density-based clustering in spatial databases: The algorithm gdbscan and its applications. Data Mining Knowl. Discovery 2(2), 169–194 (1998) Saw, J.G., Yang, M.C.K., Mo, T.C.: Chebyshev inequality with estimated mean and variance. Am. Stat. 38(2), 130–132 (1984) Sculley, D.: Web-scale k-means clustering. In: Proceedings of the 19th International Conference on World Wide Web, ACM, New York, NY, USA, WWW ’10, pp 1177–1178 (2010) Shannon, C.E.: A mathematical theory of communication. ACM SIGMOBILE Mobile Comput. Commun. Rev. 5(1), 3–55 (2001) Silverman, B.W.: Using kernel density estimates to investigate multimodality. J. R. Stat. Soc.: Ser. B 43(1), 97–99 (1981) Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman & Hall, London (1986) Vendramin, L., Campello, R.J.G.B., Hruschka, E.R.: Relative clustering validity criteria: A comparative overview. Stat. Anal. Data Mining 3(4), 209–235 (2010) Vinh, N.X., Epps, J., Bailey, J.: Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 11, 2837–2854 (2010) Doğan, Yunus, Dalkılıç, Feriştah, Birant, Derya, Kut, Recep Alp, Yılmaz, Reyat: Novel two-dimensional visualization approaches for multivariate centroids of clustering algorithms. Sci. Program. 2018, 23 (2018) Zhang, T., Ramakrishnan, R., Livny, M.: Birch: An efficient data clustering method for very large databases. SIGMOD Rec. 25(2), 103–114 (1996). https://doi.org/10.1145/235968.233324 Ünlü, R., Xanthopoulos, P.: Estimating the number of clusters in a dataset via consensus clustering. Expert Syst. Appl. 125, 33–39 (2019)