Nâng cao phân tích cụm thông qua học tập đa tạp topological

Moritz Herrmann1,2,3, Daniyal Kazempour4, Fabian Scheipl1,2, Peer Kröger4
1Department of Statistics, Ludwig-Maximilians-Universität München, Munich, Germany
2Munich Center for Machine Learning, Munich, Germany
3Institute for Medical Information Processing, Biometry and Epidemiology, Ludwig-Maximilians-Universität München, Munich, Germany
4Department of Computer Science, Christian-Albrechts-Universität zu Kiel, Kiel, Germany

Tóm tắt

Chúng tôi thảo luận về các khía cạnh hình học của phân tích cụm và chỉ ra rằng việc suy diễn cấu trúc hình học của một tập dữ liệu trước khi phân cụm có thể cải thiện đáng kể việc phát hiện cụm: chúng tôi chứng minh rằng việc phân cụm các vector nhúng đại diện cho cấu trúc tiềm ẩn của một tập dữ liệu thay vì các vector đặc trưng quan sát được sẽ mang lại nhiều lợi ích. Để minh chứng, chúng tôi kết hợp phương pháp học tạp UMAP để suy diễn cấu trúc hình học với phương pháp phân cụm dựa trên mật độ DBSCAN. Kết quả từ dữ liệu tổng hợp và thực tế cho thấy rằng điều này vừa đơn giản hóa vừa cải thiện phân cụm trong một loạt các bài toán có chiều thấp và cao, bao gồm các cụm có mật độ khác nhau và/hoặc hình dạng xoắn vào nhau. Cách tiếp cận của chúng tôi đơn giản hóa việc phân cụm vì việc xử lý trước theo hình học nhất quán làm giảm độ nhạy của các tham số của DBSCAN. Phân cụm các nhúng kết quả bằng DBSCAN thậm chí có thể vượt qua các phương pháp phức tạp như SPECTACL và ClusterGAN. Cuối cùng, cuộc điều tra của chúng tôi cho thấy rằng vấn đề quan trọng trong phân cụm không phải là số chiều danh nghĩa của dữ liệu hoặc có bao nhiêu đặc trưng không liên quan, mà đúng hơn là sự phân tách của các cụm trong không gian quan sát xung quanh mà chúng được nhúng, thường là không gian Euclid (có chiều cao) được xác định bởi các đặc trưng của dữ liệu. Cách tiếp cận thành công vì nó thực hiện phân tích cụm sau khi chiếu dữ liệu vào một không gian phù hợp hơn mà được tối ưu hóa cho sự phân tách, theo một nghĩa nào đó.

Từ khóa

#phân tích cụm #học tạp #DBSCAN #UMAP #cấu trúc hình học

Tài liệu tham khảo

Aeberhard S, Coomans D, de Vel O (1994) Comparative analysis of statistical pattern recognition methods in high dimensional settings. Pattern Recognit 27(8):1065–1077. https://doi.org/10.1016/0031-3203(94)90145-7 Aggarwal CC (2014) An introduction to cluster analysis. In: Aggarwal CC, Reddy CK (eds) Data clustering, 1st edn. Chapman and Hall/CRC, Boca Raton, pp 1–28. https://doi.org/10.1201/9781315373515 Aggarwal CC (2015) Data mining: the textbook. Springer, Cham. https://doi.org/10.1007/978-3-319-14142-8 Aggarwal CC, Reddy CK (2014) Data clustering: Algorithms and Applications. Chapman and Hall/CRC, Boca Raton. https://doi.org/10.1201/9781315373515 Alaız CM, Fernández Á, Dorronsoro JR (2015) Diffusion maps parameters selection based on neighbourhood preservation. Comput Intell 6 Alimoğlu F, Alpaydin E (2001) Combining multiple representations for pen-based handwritten digit recognition. Turk J Electr Eng Comp Sci 9(1):1–12 Allaoui M, Kherfi ML, Cheriet A (2020) Considerably improving clustering algorithms using UMAP dimensionality reduction technique: a comparative study. International conference on image and signal processing. Springer, Cham, pp 317–325. https://doi.org/10.1007/978-3-030-51935-3_34 Anderson E (1935) The irises of the gaspé peninsula. Bull Am Iris Soc 59:2–5 Ankerst M, Breunig MM, Kriegel HP et al (1999) OPTICS: ordering points to identify the clustering structure. ACM SIGMOD Rec 28(2):49–60. https://doi.org/10.1145/304181.304187 Arias-Castro E, Lerman G, Zhang T (2017) Spectral clustering based on local PCA. J Mach Learn Res 18(9):1–57 Assent I (2012) Clustering high dimensional data. WIREs Data Min Knowl Discov 2(4):340–350. https://doi.org/10.1002/widm.1062 Barton T, Bruna T, Kordik P (2019) Chameleon 2: an improved graph-based clustering algorithm. ACM Trans Knowl Discov Data. https://doi.org/10.1145/3299876 Belkin M, Niyogi P (2001) Laplacian eigenmaps and spectral techniques for embedding and clustering. In: Dietterich T, Becker S, Ghahramani Z (eds) Advances in neural information processing systems, vol 14. MIT Press, Cambridge https://proceedings.neurips.cc/paper/2001/file/f106b7f99d2cb30c3db1c3cc0fde9ccb-Paper.pdf Belkina AC, Ciccolella CO, Anno R et al (2019) Automated optimized parameters for t-distributed stochastic neighbor embedding improve visualization and analysis of large datasets. Nature Commun 10(1):5415 Ben-David S, Ackerman M (2008) Measures of clustering quality: a working set of axioms for clustering. In: Koller D, Schuurmans D, Bengio Y, et al (eds) Advances in neural information processing systems, vol 21. Curran Associates, Inc., https://proceedings.neurips.cc/paper/2008/hash/beed13602b9b0e6ecb5b568ff5058f07-Abstract.html Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828. https://doi.org/10.1109/TPAMI.2013.50 Beyer K, Goldstein J, Ramakrishnan R, et al (1999) When is “nearest neighbor” meaningful? In: Beeri C, Buneman P (eds) Database Theory - ICDT’99. ICDT 1999. Lecture Notes in Computer Science, vol 1540. Springer, Berlin, pp 217–235, https://doi.org/10.1007/3-540-49257-7_15 Bischl B, Binder M, Lang M et al (2023) Hyperparameter optimization: foundations, algorithms, best practices, and open challenges. Wiley Interdiscip Rev Data Min Knowl Discov 13(2):e1484 Blum A, Hopcroft J, Kannan R (2020) Foundations of data science. Cambridge University Press, Cambridge Boudiaf M, Rony J, Ziko IM, et al (2020) A unifying mutual information view of metric learning: cross-entropy versus pairwise losses. In: Vedaldi A, Bischof H, Brox T, et al (eds) Computer Vision - ECCV 2020. ECCV 2020. Lecture Notes in Computer Science, vol 12351. Springer, Cham, pp 548–564, https://doi.org/10.1007/978-3-030-58539-6_33 Bubenik P (2015) Statistical topological data analysis using persistence landscapes. J Mach Learn Res 16(3):77–102 Campello RJ, Moulavi D, Sander J (2013) Density-based clustering based on hierarchical density estimates. In: Pei J, Tseng V, Cao L, et al (eds) Advances in knowledge discovery and data mining. PAKDD 2013. Lecture Notes in Computer Science, vol 7819. Springer, Berlin, Heidelberg, pp 160–172, https://doi.org/10.1007/978-3-642-37456-2_14 Cayton L (2005) Algorithms for manifold learning. University of California at San Diego, Tech. rep Chazal F, Michel B (2021) An introduction to topological data analysis: Fundamental and practical aspects for data scientists. Front Artif Intell. https://doi.org/10.3389/frai.2021.667963 Chen L, Buja A (2009) Local multidimensional scaling for nonlinear dimension reduction, graph drawing, and proximity analysis. J Am Stat Assoc 104(485):209–219. https://doi.org/10.1198/jasa.2009.0111 Cohen-Addad V, Schwiegelshohn C (2017) On the local structure of stable clustering instances. arXiv preprint arXiv:1701.08423 Dalmia A, Sia S (2021) Clustering with UMAP: why and how connectivity matters. arXiv preprint https://arxiv.org/abs/2108.05525 Davies DL, Bouldin DW (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 2:224–227 Debatty T, Michiardi P, Mees W, et al (2014) Determining the k in k-means with MapReduce. In: EDBT/ICDT 2014 Joint Conference, Athènes, Greece, proceedings of the workshops of the EDBT/ICDT 2014 joint conference, https://hal.archives-ouvertes.fr/hal-01525708 Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Series B Stat Methodol 39(1):1–22 Doraiswamy H, Tierny J, Silva PJ et al (2021) TopoMap: a 0-dimensional homology preserving projection of high-dimensional data. IEEE Trans Vis Comput Graph 27(2):561–571. https://doi.org/10.1109/TVCG.2020.3030441 Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml Ester M, Kriegel HP, Sander J, et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining, pp 226–231 Feldman D, Schmidt M, Sohler C (2020) Turning big data into tiny data: constant-size coresets for k-means, PCA, and projective clustering. SIAM J Comput 49(3):601–657. https://doi.org/10.1137/18M1209854 Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7(2):179–188 Forina M, Leard R, Armanino C, et al (1988) Parvus: an extendible package for data exploration, classification and correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy Giordani P, Ferraro MB, Martella F (2020) An introduction to clustering with R, 1st edn. Springer, Singapore. https://doi.org/10.1007/978-981-13-0553-5 Goebl S, He X, Plant C, et al (2014) Finding the optimal subspace for clustering. In: 2014 IEEE international conference on data mining, pp 130–139, https://doi.org/10.1109/ICDM.2014.34 Guan S, Loew M (2021) A novel intrinsic measure of data separability. arXiv preprint https://arxiv.org/abs/2109.05180 Hamerly G, Elkan C (2003) Learning the k in k-means. In: Thrun S, Saul L, Schölkopf B (eds) Advances in neural information processing systems, vol 16. MIT Press, https://proceedings.neurips.cc/paper/2003/file/234833147b97bb6aed53a8f4f1c7a7d8-Paper.pdf Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference and prediction, 2nd edn. Springer, New York. https://doi.org/10.1007/978-0-387-84858-7 Hennig C, Meila M, Murtagh F et al (2015) Handbook of cluster analysis, 1st edn. Chapman and Hall/CRC, New York. https://doi.org/10.1201/b19706 Herrmann M, Scheipl F (2020) Unsupervised functional data analysis via nonlinear dimension reduction. arXiv preprint arXiv:2012.11987 Herrmann M, Scheipl F (2021) A geometric perspective on functional outlier detection. Stats 4(4):971–1011. https://doi.org/10.3390/stats4040057 Hess S, Duivesteijn W, Honysz P, et al (2019) The SpectACl of nonconvex clustering: a spectral approach to density-based clustering. In: Proceedings of the AAAI conference on artificial intelligence, pp 3788–3795, https://doi.org/10.1609/aaai.v33i01.33013788 Hopkins B, Skellam JG (1954) A new method for determining the type of distribution of plant individuals. Ann Bot 18(2):213–227 Hozumi Y, Wang R, Yin C et al (2021) UMAP-assisted k-means clustering of large-scale SARS-CoV-2 mutation datasets. Comput Biol Med 131(104):264. https://doi.org/10.1016/j.compbiomed.2021.104264 Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218. https://doi.org/10.1007/BF01908075 Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31(8):651–666. https://doi.org/10.1016/j.patrec.2009.09.011 Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323. https://doi.org/10.1145/331499.331504 Kaya IE, Pehlivanlı AÇ, Sekizkardeş EG et al (2017) PCA based clustering for brain tumor segmentation of T1w MRI images. Comput Methods Programs Biomed 140:19–28. https://doi.org/10.1016/j.cmpb.2016.11.011 Kobak D, Linderman GC (2021) Initialization is critical for preserving global data structure in both t-SNE and UMAP. Nat Biotechnol 39(2):156–157. https://doi.org/10.1038/s41587-020-00809-z Kraemer G, Reichstein M, Mahecha MD (2018) dimRed and coRanking - Unifying Dimensionality Reduction in R. R J 10(1):342. https://doi.org/10.32614/RJ-2018-039 Kriegel HP, Kröger P, Zimek A (2009) Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Trans Knowl Discov Data doi 10(1145/1497577):1497578 Kriegel HP, Kröger P, Sander J et al (2011) Density-based clustering. WIREs Data Min Knowl Discov 1(3):231–240. https://doi.org/10.1002/widm.30 Lecun Y, Bottou L, Bengio Y, et al (1998) Gradient-based learning applied to document recognition. In: Proceedings of the IEEE, pp 2278–2324, https://doi.org/10.1109/5.726791 Lee JA, Verleysen M (2007) Nonlinear dimensionality reduction, 1st edn. Springer, New York. https://doi.org/10.1007/978-0-387-39351-3 Lee JA, Verleysen M (2008) Quality assessment of nonlinear dimensionality reduction based on K-ary neighborhoods. In: Saeys Y, Liu H, Inza I, et al (eds) Proceedings of the workshop on new challenges for feature selection in data mining and knowledge discovery at ECML/PKDD 2008, proceedings of machine learning research, vol 4. PMLR, Antwerp, Belgium, pp 21–35 Lee JA, Verleysen M (2009) Quality assessment of dimensionality reduction: rank-based criteria. Neurocomputing 72(7–9):1431–1443. https://doi.org/10.1016/j.neucom.2008.12.017 Liang J, Chenouri S, Small CG (2020) A new method for performance analysis in nonlinear dimensionality reduction. Stat Anal Data Min ASA Data Sci J 13(1):98–108. https://doi.org/10.1002/sam.11445 Linderman GC, Steinerberger S (2019) Clustering with t-sne, provably. SIAM J Math Data Sci 1(2):313–332. https://doi.org/10.1137/18M1216134 Liu J, Han J (2014) Spectral clustering. In: Aggarwal CC, Reddy CK (eds) Data clustering, 1st edn. Chapman and Hall/CRC, Boca Raton, pp 177–200. https://doi.org/10.1201/9781315373515 Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137. https://doi.org/10.1109/TIT.1982.1056489 Ma Y, Fu Y (eds) (2012) Manifold learning theory and applications, vol 434, 1st edn. CRC Press, Boca Raton. https://doi.org/10.1201/b11431 Mautz D, Ye W, Plant C, et al (2017) Towards an optimal subspace for k-means. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 365–373, https://doi.org/10.1145/3097983.3097989 McInnes L (2018) Using UMAP for Clustering. https://umap-learn.readthedocs.io/en/latest/clustering.html, [Online; accessed 11-January-2022] McInnes L, Healy J, Melville J (2018) Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint https://arxiv.org/abs/1802.03426 Mehar AM, Matawie K, Maeder A (2013) Determining an optimal value of k in k-means clustering. In: 2013 IEEE international conference on bioinformatics and biomedicine, pp 51–55, https://doi.org/10.1109/BIBM.2013.6732734 Mittal M, Goyal LM, Hemanth DJ et al (2019) Clustering approaches for high-dimensional databases: a review. WIREs Data Min Knowl Discov 9(3):e1300. https://doi.org/10.1002/widm.1300 Mu Z, Wu Y, Yin H, et al (2020) Study on single-phase ground fault location of distribution network based on MDS and DBSCAN clustering. In: 2020 39th Chinese control conference (CCC), IEEE, pp 6146–6150, https://doi.org/10.23919/CCC50068.2020.9188678 Mukherjee S, Asnani H, Lin E, et al (2019) ClusterGAN: latent space clustering in generative adversarial networks. In: Proceedings of the AAAI conference on artificial intelligence, pp 4610–4617, https://doi.org/10.1609/aaai.v33i01.33014610 Nane S, Nayar S, Murase H (1996) Columbia object image library: COIL-20. Department of Computer Science, Columbia University, New York, Tech. rep Niyogi P, Smale S, Weinberger S (2011) A topological view of unsupervised learning from noisy data. SIAM J Comput 40(3):646–663. https://doi.org/10.1137/090762932 Pandove D, Goel S, Rani R (2018) Systematic review of clustering high-dimensional and large datasets. ACM Trans Knowl Discov Data 12(2):1–68. https://doi.org/10.1145/3132088 Pealat C, Bouleux G, Cheutet V (2021) Improved time-series clustering with UMAP dimension reduction method. In: 2020 25th international conference on pattern recognition (ICPR), IEEE, pp 5658–5665, https://doi.org/10.1109/ICPR48806.2021.9412261 Pham DT, Dimov SS, Nguyen CD (2005) Selection of k in k-means clustering. Proc Inst Mech Eng Part C J Mech Eng Sci 219(1):103–119. https://doi.org/10.1243/095440605X8298 Putri GH, Read MN, Koprinska I et al (2019) Dimensionality reduction for clustering and cluster tracking of cytometry data. In: Tetko IV, Kůrková V, Karpov P et al (eds) Artificial neural networks and machine learning - ICANN 2019: text and time series. ICANN 2019. Lecture notes in computer science, vol 11730. Springer, Cham, pp 624–640. https://doi.org/10.1007/978-3-030-30490-4_50 Rasmussen C (2000) The infinite gaussian mixture model. In: Solla S, Leen T, Müller K (eds) Advances in neural information processing systems, vol 12. MIT Press, Cambridge Rendón E, Abundez I, Arizmendi A et al (2011) Internal versus external cluster validation indexes. Int J Comput Commun Control 5(1):27–34 Rieck B, Leitte H (2015) Persistent homology for the evaluation of dimensionality reduction schemes. Comput Graph Forum 34(3):431–440. https://doi.org/10.1111/cgf.12655 Riehl E (2011) A leisurely introduction to simplicial sets. Unpublished expository article available online at http://www.math.harvard.edu/eriehl Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65. https://doi.org/10.1016/0377-0427(87)90125-7 Saul LK, Weinberger KQ, Sha F et al (2006) Spectral methods for dimensionality reduction. In: Chapelle O, Schölkopf B, Zien A (eds) Semi-supervised Learning. MIT Press, Cambridge, Massachusetts, pp 293–308 Saxena A, Prasad M, Gupta A et al (2017) A review of clustering techniques and developments. Neurocomputing 267:664–681. https://doi.org/10.1016/j.neucom.2017.06.053 Schubert E, Sander J, Ester M et al (2017) DBSCAN revisited, revisited: Why and how you should (still) use DBSCAN. ACM Trans Database Syst 42(3):1–21. https://doi.org/10.1145/3068335 Scitovski R, Sabo K, Martínez Álvarez F et al (2021) Cluster analysis and applications, 1st edn. Springer, Cham. https://doi.org/10.1007/978-3-030-74552-3 Souvenir R, Pless R (2005) Manifold clustering. In: Tenth IEEE international conference on computer vision (ICCV’05), vol 1, IEEE, pp 648–653, https://doi.org/10.1109/ICCV.2005.149 Tenenbaum JB, De Silva V, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500):2319–2323. https://doi.org/10.1126/science.290.5500.2319 Thrun MC, Ultsch A (2020) Clustering benchmark datasets exploiting the fundamental clustering problems. Data Brief 30(105):501. https://doi.org/10.1016/j.dib.2020.105501 Ullmann T, Hennig C, Boulesteix AL (2022) Validation of cluster analysis results on validation data: a systematic framework. WIREs Data Min Knowl Discov 12(3):e1444. https://doi.org/10.1002/widm.1444 Ultsch A (2005) Clustering with SOM: U*C. In: Proceedings of the workshop on self-organizing maps, Paris, France, https://doi.org/10.13140/RG.2.1.2394.5446 Ultsch A, Lötsch J (2020) The fundamental clustering and projection suite (FCPS): a dataset collection to test the performance of clustering and data projection algorithms. Data. https://doi.org/10.3390/data5010013 van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9(86):2579–2605 Van Mechelen I, Boulesteix AL, Dangl R, et al (2018) Benchmarking in cluster analysis: a white paper. arXiv:1809.10496 [stat] Vinh NX, Epps J, Bailey J (2010a) Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J Mach Learn Res 11(95):2837–2854 Vinh NX, Epps J, Bailey J (2010b) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11(95):2837–2854 Von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17(4):395–416. https://doi.org/10.1007/s11222-007-9033-z Wang Y, Huang H, Rudin C et al (2021) Understanding how dimension reduction tools work: An empirical approach to deciphering t-SNE, UMAP, TriMap, and PaCMAP for data visualization. J Mach Learn Res 22:1–73 Wasserman L (2018) Topological data analysis. Annu Rev Stat Appl 5(1):501–532. https://doi.org/10.1146/annurev-statistics-031017-100045 Wolf FA, Hamey FK, Plass M et al (2019) PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol 20(59):1–9. https://doi.org/10.1186/s13059-019-1663-x Xiao H, Rasul K, Vollgraf R (2017) Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 Zimek A, Vreeken J (2015) The blind men and the elephant: on meeting the problem of multiple truths in data from clustering and pattern mining perspectives. Mach Learn 98(1):121–155. https://doi.org/10.1007/s10994-013-5334-y Zimmermann A (2020) Method evaluation, parameterization, and result validation in unsupervised data mining: a critical survey. WIREs Data Min Knowl Discov 10(2):e1330. https://doi.org/10.1002/widm.1330 Zomorodian A, Carlsson G (2005) Computing persistent homology. Discrete Comput Geom 33(2):249–274. https://doi.org/10.1007/s00454-004-1146-y