Incremental Density-Based Clustering on Multicore Processors

IEEE Transactions on Pattern Analysis and Machine Intelligence - Tập 44 Số 3 - Trang 1338-1356 - 2022
Son T. Mai1, Jon Jacobsen2, Sihem Amer-Yahia3, Ivor Spence1, Nhat-Phuong Tran1, Ira Assent2, Quoc Viet Hung Nguyen4
1Queen’s University Belfast, Belfast, United Kingdom
2Aarhus University, Aarhus, Denmark
3University of Grenoble Alpes, Saint-Martin-d'Hères, France
4Griffith University, Brisbane, Queensland, Australia

Tóm tắt

The density-based clustering algorithm is a fundamental data clustering technique with many real-world applications. However, when the database is frequently changed, how to effectively update clustering results rather than reclustering from scratch remains a challenging task. In this work, we introduce IncAnyDBC, a unique parallel incremental data clustering approach to deal with this problem. First, IncAnyDBC can process changes in bulks rather than batches like state-of-the-art methods for reducing update overheads. Second, it keeps an underlying cluster structure called the object node graph during the clustering process and uses it as a basis for incrementally updating clusters wrt. inserted or deleted objects in the database by propagating changes around affected nodes only. In additional, IncAnyDBC actively and iteratively examines the graph and chooses only a small set of most meaningful objects to produce exact clustering results of DBSCAN or to approximate results under arbitrary time constraints. This makes it more efficient than other existing methods. Third, by processing objects in blocks, IncAnyDBC can be efficiently parallelized on multicore CPUs, thus creating a work-efficient method. It runs much faster than existing techniques using one thread while still scaling well with multiple threads. Experiments are conducted on various large real datasets for demonstrating the performance of IncAnyDBC.

Từ khóa

#Density-based clustering #anytime clustering #incremental clustering #active clustering #multicore CPUs

Tài liệu tham khảo

Aggarwal, 2014, Data Clustering: Algorithms and Applications, 10.1201/b17320 Han, 2012, Data Mining: Concepts and Techniques Ester, Incremental clustering for mining in a data warehousing environment, Proc. 24rd Int. Conf. Very Large Data Bases, 323 10.1007/978-3-319-56608-5_50 10.1145/2505515.2507837 10.1145/3035918.3064050 10.5555/3001460.3001507 10.1145/1247480.1247546 10.1109/ICDM.2012.95 10.1016/j.pmcj.2009.07.005 10.1145/3083897 10.1145/2723372.2737792 Gunawan, 2013, A faster algorithm for DBSCAN 10.1145/2939672.2939750 Frank, 2019, UCI machine learning repository 10.1111/j.1365-2966.2006.11287.x 10.1007/s10618-018-0562-1 10.1145/3068335 10.1007/s10115-016-1004-2 10.1109/SC.2012.9 Tan, 2005, Introduction to Data Mining, (First Edition) 10.1145/2834892.2834894 10.1145/258533.258657 Ackerman, Incremental clustering: The case for extra clusters, Proc. 27th Int. Conf. Neural Inf. Process. Syst., 307 10.1007/978-3-540-24741-8_8 10.1109/ICDE.2017.94 10.1109/TKDE.2018.2828086 10.1137/1.9781611972764.29 10.1145/1552303.1552307 10.1007/978-3-319-55705-2_10 10.1007/978-3-319-44039-2_3 10.1109/ICDM.2004.10082 Böhm, Efficient anytime density-based clustering, Proc. SIAM Int. Conf. Data Mining, 112 10.1007/s10115-014-0797-0 10.1109/DASFAA.2001.916361 10.1109/ICISIP.2004.1287631 10.1007/3-540-36175-8_56 10.1007/978-3-642-37456-2_14 10.1145/2733381 10.1007/11731139_22 10.14778/3021924.3021932 10.1109/SC.2014.51 10.1145/1645953.1646038 10.1109/ICDE.1998.655795 10.1109/ICPADS.2011.83 10.1109/CLOUD.2012.42 10.1145/2503210.2503262 10.1016/j.procs.2013.05.200 10.1145/3132847.3133112 10.1145/3183713.3196887 10.1007/978-3-540-30116-5_23 10.1109/BigData.2018.8622258