Incremental Density-Based Clustering on Multicore Processors
IEEE Transactions on Pattern Analysis and Machine Intelligence - Tập 44 Số 3 - Trang 1338-1356 - 2022
Tóm tắt
The density-based clustering algorithm is a fundamental data clustering technique with many real-world applications. However, when the database is frequently changed, how to effectively update clustering results rather than reclustering from scratch remains a challenging task. In this work, we introduce IncAnyDBC, a unique parallel incremental data clustering approach to deal with this problem. First, IncAnyDBC can process changes in bulks rather than batches like state-of-the-art methods for reducing update overheads. Second, it keeps an underlying cluster structure called the object node graph during the clustering process and uses it as a basis for incrementally updating clusters wrt. inserted or deleted objects in the database by propagating changes around affected nodes only. In additional, IncAnyDBC actively and iteratively examines the graph and chooses only a small set of most meaningful objects to produce exact clustering results of DBSCAN or to approximate results under arbitrary time constraints. This makes it more efficient than other existing methods. Third, by processing objects in blocks, IncAnyDBC can be efficiently parallelized on multicore CPUs, thus creating a work-efficient method. It runs much faster than existing techniques using one thread while still scaling well with multiple threads. Experiments are conducted on various large real datasets for demonstrating the performance of IncAnyDBC.
Từ khóa
#Density-based clustering #anytime clustering #incremental clustering #active clustering #multicore CPUsTài liệu tham khảo
Aggarwal, 2014, Data Clustering: Algorithms and Applications, 10.1201/b17320
Han, 2012, Data Mining: Concepts and Techniques
Ester, Incremental clustering for mining in a data warehousing environment, Proc. 24rd Int. Conf. Very Large Data Bases, 323
10.1007/978-3-319-56608-5_50
10.1145/2505515.2507837
10.1145/3035918.3064050
10.5555/3001460.3001507
10.1145/1247480.1247546
10.1109/ICDM.2012.95
10.1016/j.pmcj.2009.07.005
10.1145/3083897
10.1145/2723372.2737792
Gunawan, 2013, A faster algorithm for DBSCAN
10.1145/2939672.2939750
Frank, 2019, UCI machine learning repository
10.1111/j.1365-2966.2006.11287.x
10.1007/s10618-018-0562-1
10.1145/3068335
10.1007/s10115-016-1004-2
10.1109/SC.2012.9
Tan, 2005, Introduction to Data Mining, (First Edition)
10.1145/2834892.2834894
10.1145/258533.258657
Ackerman, Incremental clustering: The case for extra clusters, Proc. 27th Int. Conf. Neural Inf. Process. Syst., 307
10.1007/978-3-540-24741-8_8
10.1109/ICDE.2017.94
10.1109/TKDE.2018.2828086
10.1137/1.9781611972764.29
10.1145/1552303.1552307
10.1007/978-3-319-55705-2_10
10.1007/978-3-319-44039-2_3
10.1109/ICDM.2004.10082
Böhm, Efficient anytime density-based clustering, Proc. SIAM Int. Conf. Data Mining, 112
10.1007/s10115-014-0797-0
10.1109/DASFAA.2001.916361
10.1109/ICISIP.2004.1287631
10.1007/3-540-36175-8_56
10.1007/978-3-642-37456-2_14
10.1145/2733381
10.1007/11731139_22
10.14778/3021924.3021932
10.1109/SC.2014.51
10.1145/1645953.1646038
10.1109/ICDE.1998.655795
10.1109/ICPADS.2011.83
10.1109/CLOUD.2012.42
10.1145/2503210.2503262
10.1016/j.procs.2013.05.200
10.1145/3132847.3133112
10.1145/3183713.3196887
10.1007/978-3-540-30116-5_23
10.1109/BigData.2018.8622258