On Density-Based Data Streams Clustering Algorithms: A Survey
Tóm tắt
Clustering data streams has drawn lots of attention in the last few years due to their ever-growing presence. Data streams put additional challenges on clustering such as limited time and memory and one pass clustering. Furthermore, discovering clusters with arbitrary shapes is very important in data stream applications. Data streams are infinite and evolving over time, and we do not have any knowledge about the number of clusters. In a data stream environment due to various factors, some noise appears occasionally. Density-based method is a remarkable class in clustering data streams, which has the ability to discover arbitrary shape clusters and to detect noise. Furthermore, it does not need the number of clusters in advance. Due to data stream characteristics, the traditional density-based clustering is not applicable. Recently, a lot of density-based clustering algorithms are extended for data streams. The main idea in these algorithms is using density-based methods in the clustering process and at the same time overcoming the constraints, which are put out by data stream’s nature. The purpose of this paper is to shed light on some algorithms in the literature on density-based clustering over data streams. We not only summarize the main density-based clustering algorithms on data streams, discuss their uniqueness and limitations, but also explain how they address the challenges in clustering data streams. Moreover, we investigate the evaluation metrics used in validating cluster quality and measuring algorithms’ performance. It is hoped that this survey will serve as a steppingstone for researchers studying data streams clustering, particularly density-based algorithms.
Tài liệu tham khảo
citation_title=Data Mining: Concepts and Techniques (3rd edition); citation_publication_date=2011; citation_id=CR1; citation_author=J Han; citation_author=M Kamber; citation_author=J Pei; citation_publisher=Morgan Kaufmann Publishers Inc.
citation_journal_title=ACM SIGMOD Record; citation_title=Mining data streams: A review; citation_author=M Gaber, A Zaslavsky, S Krishnaswamy; citation_volume=34; citation_issue=2; citation_publication_date=2005; citation_pages=18-26; citation_doi=10.1145/1083784.1083789; citation_id=CR2
Cao F, Ester M, Qian W, Zhou A. Density-based clustering over an evolving data stream with noise. In Proc. the 2006 SIAM Conference on Data Mining, April 2006, pp. 328-339.
Chen Y, Tu L. Density-based clustering for real-time stream data. In Proc. the 13th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Aug. 2007, pp. 133-142.
Aggarwal C C (ed.). Data Streams: Models and Algorithms. New York, USA: Springer, 2007.
Hahsler M, Dunham M H. Temporal structure learning for clustering massive data streams in real-time. In Proc. the 11th SIAM Conference on Data Mining, April 2011, pp. 664-675.
O Ćallaghan L, Mishra N, Meyerson A et al. Streaming-data algorithms for high-quality clustering. In Proc. the 18th Int. Conf. Data Engineering, Feb. 26-Mar. 1, 2002, pp. 685-694.
citation_journal_title=SIGKDD Explorations Newsletter; citation_title=Requirements for clustering data streams; citation_author=D Barbará; citation_volume=3; citation_issue=2; citation_publication_date=2002; citation_pages=23-27; citation_doi=10.1145/507515.507519; citation_id=CR8
citation_journal_title=IEEE Trans. Knowledge and Data Engineering; citation_title=Clustering data streams: Theory and practice; citation_author=S Guha, A Meyerson, N Mishra; citation_volume=15; citation_issue=3; citation_publication_date=2003; citation_pages=515-528; citation_doi=10.1109/TKDE.2003.1198387; citation_id=CR9
Aggarwal C C, Han J, Wang J, Yu P S. A framework for clustering evolving data streams. In Proc. the 29th International Conference on Very Large Data Bases, Sept. 2003, pp. 81-92.
Ackermann M R, Lammersen C, Märtens M, Raupach C, Sohler C, Swierkot K. StreamKM++: A clustering algorithm for data streams. In Proc. the 12th Workshop on Algorithm Engineering and Experiments, Jan. 2010, pp. 173-187.
Ikonomovska E, Loskovska S, Gjorgjevik D. A survey of stream data mining. In Proc. the 8th National Conference with International Participation, Sept. 2007, pp. 19-25.
Gaber M, Zaslavsky A, Krishnaswamy S. Data stream mining. Data Mining and Knowledge Discovery Handbook, 2010, pp. 759-787.
Babcock B, Babu S, Datar M, Motwani R, Widom J. Models and issues in data stream systems. In Proc. the 21st ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, June 2002, pp. 1-16.
Jain A K, Dubes R C. Algorithms for Clustering Data. Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1988.
citation_journal_title=Pattern Recognition Letter; citation_title=Data clustering: 50 years beyond K-means; citation_author=AK Jain; citation_volume=31; citation_issue=8; citation_publication_date=2010; citation_pages=651-666; citation_doi=10.1016/j.patrec.2009.09.011; citation_id=CR16
citation_journal_title=Int. J. Knowledge-Based and Intelligent Engineering Systems; citation_title=Clustering data stream: A survey of algorithms; citation_author=A Mahdiraji; citation_volume=13; citation_issue=2; citation_publication_date=2009; citation_pages=39-44; citation_id=CR17
Amini A, Wah T, Saybani M et al. A study of density-grid based clustering algorithms on data streams. In Proc. the 8th Int. Conf. Fuzzy Systems and Knowledge Discovery, July 2011, pp. 1652-1656.
Amini A, Wah T Y. Density micro-clustering algorithms on data streams: A review. In Proc. Int. Multiconf. Data Mining and Applications, March 2011, pp. 410-414.
Amini A, Wah T Y. A comparative study of density-based clustering algorithms on data streams: Micro-clustering approaches. In Lecture Notes in Electrical Engineering 110, Ao S, Castillo O, Huang X (eds.), Springer, 2012, pp. 275-287.
Aggarwal C C. A survey of stream clustering algorithms. In Data Clustering: Algorithms and Applications, Aggarwal C C, Reddy C (eds.), CRC Press, 2013, pp. 457-482.
citation_title=Data Mining: Concepts and Techniques (2nd edition). Morgan Kaufmann; citation_publication_date=2006; citation_id=CR22; citation_author=J Han; citation_author=M Kamber
MacQueen J. Some methods for classification and analysis of multivariate observations. In Proc. the 5th Berkeley Symposium on Mathematical Statistics and Probability, June 21-July 18, 1967, pp. 281-297.
citation_journal_title=IEEE Transactions on Information Theory; citation_title=Least squares quantization in PCM; citation_author=SP Lloyd; citation_volume=28; citation_issue=2; citation_publication_date=1982; citation_pages=129-137; citation_doi=10.1109/TIT.1982.1056489; citation_id=CR24
Guha S, Mishra N, Motwani R, O’Callaghan L. Clustering data streams. In Proc. the 41st Annual Symposium on Foundations of Computer Science, Nov. 2000, pp. 359-366.
Zhang T, Ramakrishnan R, Livny M. BIRCH: An efficient data clustering method for very large databases. In Proc. the 1996 ACM SIGMOD International Conference on Management of Data, June 1996, pp. 103-114.
citation_journal_title=Computer; citation_title=Chameleon: Hierarchical clustering using dynamic modeling; citation_author=G Karypis, E Han, V Kumar; citation_volume=32; citation_issue=8; citation_publication_date=1999; citation_pages=68-75; citation_doi=10.1109/2.781637; citation_id=CR27
citation_journal_title=Knowl. Inf. Syst.; citation_title=The clustree: Indexing micro-clusters for anytime stream mining; citation_author=P Kranen, I Assent, C Baldauf, T Seidl; citation_volume=29; citation_issue=2; citation_publication_date=2011; citation_pages=249-272; citation_doi=10.1007/s10115-010-0342-8; citation_id=CR28
Wang W, Yang J, Muntz R R. STING: A statistical information grid approach to spatial data mining. In Proc. the 23rd Int. Conf. Very Large Data Bases, Aug. 1997, pp. 186-195.
citation_journal_title=The VLDB Journal; citation_title=Wavecluster: A wavelet-based clustering approach for spatial data in very large databases; citation_author=G Sheikholeslami, S Chatterjee, A Zhang; citation_volume=8; citation_issue=3/4; citation_publication_date=2000; citation_pages=289-304; citation_doi=10.1007/s007780050009; citation_id=CR30
citation_journal_title=ACM SIGMOD Record; citation_title=Automatic subspace clustering of high dimensional data for data mining applications; citation_author=R Agrawal, J Gehrke, D Gunopulos, P Raghavan; citation_volume=27; citation_issue=2; citation_publication_date=1998; citation_pages=94-105; citation_doi=10.1145/276305.276314; citation_id=CR31
Tu L, Chen Y. Stream data clustering based on grid density and attraction. ACM Transactions on Knowledge Discovery Data, 2009, 3(3): Article No. 12.
Wan L, Ng W K, Dang X H et al. Density-based clustering of data streams at multiple resolutions. ACM Trans. Knowledge Discovery from Data, 2009, 3(3): Article No. 14.
citation_journal_title=Journal of the Royal Statistical Society, Series B; citation_title=Maximum likelihood from incomplete data via the EM algorithm; citation_author=AP Dempster, NM Laird, DB Rubin; citation_volume=39; citation_issue=1; citation_publication_date=1977; citation_pages=1-38; citation_id=CR34
Dang X, Lee V, Ng W K et al. An EM-based algorithm for clustering data streams in sliding windows. In Proc. the Int. Conf. Database Systems for Advanced Applications, Apr. 2009, pp. 230-235.
Ester M, Kriegel H, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proc. the 2nd International Conference on Knowledge Discovery and Data Mining, Aug. 1996, pp. 226-231.
citation_journal_title=ACM SIGMOD Record; citation_title=Optics: Ordering points to identify the clustering structure; citation_author=M Ankerst, MM Breunig, HP Kriegel, J Sander; citation_volume=28; citation_issue=2; citation_publication_date=1999; citation_pages=49-60; citation_doi=10.1145/304181.304187; citation_id=CR37
Hinneburg A, Keim D A. An efficient approach to clustering in large multimedia databases with noise. In Proc. the 4th KDD, Sept. 1998, pp. 58-65.
citation_title=Data stream mining: Basic methods and techniques; citation_publication_date=2012; citation_id=CR39; citation_author=M Matysiak; citation_publisher=RWTH Aachen University
citation_journal_title=Knowledge and Information Systems; citation_title=Tracking clusters in evolving data streams over sliding windows; citation_author=A Zhou, F Cao, W Qian, C Jin; citation_volume=15; citation_issue=2; citation_publication_date=2008; citation_pages=181-214; citation_doi=10.1007/s10115-007-0070-x; citation_id=CR40
Ren J, Ma R. Density-based data streams clustering over sliding windows. In Proc. the 6th Int. Conf. Fuzzy systems and Knowledge Discovery, Aug. 2009, pp. 248-252.
Charikar M, O’Callaghan L, Panigrahy R. Better streaming algorithms for clustering problems. In Proc. the 35th Annual ACM Symp. Theory of Computing, June 2003, pp. 30-39.
Gao J, Li J, Zhang Z, Tan P N. An incremental data stream clustering algorithm based on dense units detection. In Proc. the 9th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, May 2005, pp. 420-425.
Aggarwal C C, Han J, Wang J, Yu P S. A framework for projected clustering of high dimensional data streams. In Proc. the 30th International Conference on Very Large Data Bases, Volume 30, Aug. 29-Sept. 3, 2004, pp. 852-863.
citation_journal_title=Data Mining and Knowledge Discovery; citation_title=On high dimensional projected clustering of data streams; citation_author=CC Aggarwal, J Han, J Wang, PS Yu; citation_volume=10; citation_issue=3; citation_publication_date=2005; citation_pages=251-273; citation_doi=10.1007/s10618-005-0645-7; citation_id=CR45
Babcock B, Datar M, Motwani R, O’Callaghan L. Maintaining variance and k-medians over data stream windows. In Proc. the 22nd ACM SIGMOD-SIGACT-SIGART Symp. Principles of Database Systems, June 2003, pp. 234-243.
Ng W, Dash M. Discovery of frequent patterns in transactional data streams. In Lecture Notes in Computer Science 6380, Hameurlain A, Küng J, Wagner R et al. (eds.), Springer Berlin/Heidelberg, 2010, pp. 1-30.
citation_journal_title=ACM Trans. Math. Softw.; citation_title=Random sampling with a reservoir; citation_author=JS Vitter; citation_volume=11; citation_issue=1; citation_publication_date=1985; citation_pages=37-57; citation_doi=10.1145/3147.3165; citation_id=CR48
Garofalakis M, Gehrke J, Rastogi R. Querying and mining data streams: You only get one look: A tutorial. In Proc. the 2002 ACM SIGMOD Int. Conf. Management of Data, June 2002, pp. 635-635.
Aggarwal C C, Yu P S. A survey of synopsis construction in data streams. In Advances in Database Systems 31, Aggarwal C C (ed.), Springer, 2007, pp. 169-207.
Garofalakis M N. Wavelets on streams. In Encyclopedia of Database Systems, Springer US, 2009, pp. 3446-3451.
citation_journal_title=IEEE Trans. Knowl. and Data Eng.; citation_title=One-pass wavelet decompositions of data streams; citation_author=AC Gilbert, Y Kotidis, S Muthukrishnan, MJ Strauss; citation_volume=15; citation_issue=3; citation_publication_date=2003; citation_pages=541-554; citation_doi=10.1109/TKDE.2003.1198389; citation_id=CR52
Gama J, Gaber M M (eds.). Learning from Data Streams - Processing Techniques in Sensor Networks. Springer, 2007.
citation_journal_title=SIGKDD Explorations Newsletter; citation_title=KDD-cup 99: Knowledge discovery in a charitable organization’s donor database; citation_author=S Rosset, A Inger; citation_volume=1; citation_issue=2; citation_publication_date=2000; citation_pages=85-90; citation_doi=10.1145/846183.846204; citation_id=CR54
citation_journal_title=Psychological Bulletin; citation_title=A general statistical framework for assessing categorical clustering in free recall; citation_author=LJ Hubert, JR Levin; citation_volume=83; citation_issue=6; citation_publication_date=1976; citation_pages=1072-1080; citation_doi=10.1037/0033-2909.83.6.1072; citation_id=CR55
Kaufman L, Rousseeuw P J. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley-Interscience, 2005.
Wu J, Xiong H, Chen J. Adapting the right measures for K-means clustering. In Proc. the 15th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, June 2009, pp. 877-886.
citation_journal_title=Journal of the American Statistical Association; citation_title=Objective criteria for the evaluation of clustering methods; citation_author=WM Rand; citation_volume=66; citation_issue=336; citation_publication_date=1971; citation_pages=846-850; citation_doi=10.1080/01621459.1971.10482356; citation_id=CR58
citation_journal_title=Machine Learning; citation_title=Empirical and theoretical comparisons of selected criterion functions for document clustering; citation_author=Y Zhao, G Karypis; citation_volume=55; citation_issue=3; citation_publication_date=2004; citation_pages=311-331; citation_doi=10.1023/B:MACH.0000027785.44527.d6; citation_id=CR59
citation_title=Performance criteria for graph clustering and Markov cluster experiments; citation_publication_date=2000; citation_id=CR60; citation_author=S Dongen; citation_publisher=Technical Report
Rosenberg A, Hirschberg J. V-measure: A conditional entropy-based external cluster evaluation measure. In Proc. the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, June 2007, pp. 410-420.
Meilǎ M. Comparing clusterings: An axiomatic view. In Proc. the 22nd Int. Conf. Machine Learning, Aug. 2005, pp. 577-584.
Rijsbergen C J V. Information Retrieval. Newton, MA, USA: Butterworth-Heinemann, 1979.
citation_journal_title=Psychometrika; citation_title=A Monte Carlo study of thirty internal criterion measures for cluster analysis; citation_author=G Milligan; citation_volume=46; citation_issue=2; citation_publication_date=1981; citation_pages=187-199; citation_doi=10.1007/BF02293899; citation_id=CR64
Pereira C M M, de Mello R F. A comparison of clustering algorithms for data streams. In Proc. the 1st Int. Conf. Integrated Comp. Tech., May 31-June 2, 2011, pp. 59-74.
Manning C D, Raghavan P, Schtze H. Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press, 2008.
citation_journal_title=Data Mining and Knowledge Discovery; citation_title=A single pass algorithm for clustering evolving data streams based on swarm intelligence; citation_author=A Forestiero, C Pizzuti, G Spezzano; citation_volume=26; citation_issue=1; citation_publication_date=2013; citation_pages=1-26; citation_doi=10.1007/s10618-011-0242-x; citation_id=CR67
citation_journal_title=Journal of Machine Learning Research; citation_title=MOA: Massive online analysis, a framework for stream classification and clustering; citation_author=A Bifet, G Holmes, B Pfahringer; citation_volume=11; citation_publication_date=2010; citation_pages=44-50; citation_id=CR68
Holmes G, Donkin A, Witten I H. WEKA: A machine learning workbench. In Proc. the 2nd Australian and New Zealand Conference on Intelligent Information Systems, Nov. 29-Dec. 3, 1994, pp. 357-361.
Kranen P, Kremer H, Jansen T et al. Clustering performance on evolving data streams: Assessing algorithms and evaluation measures within MOA. In Proc. the IEEE Int. Conf. Data Mining Workshops, Dec. 2010, pp. 1400-1403.
Kremer H, Kranen P, Jansen T et al. An effective evaluation measure for clustering on evolving data streams. In Proc. the 17th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, July 2011, pp. 868-876.
De Francisci Morales G. SAMOA: A platform for mining big data streams. In Proc. the 22nd Int. Conf. World Wide Web Companion, May 2013, pp. 777-778.
Tasoulis D K, Ross G, Adams N M. Visualising the cluster structure of data streams. In Proc. the 7th International Conference on Intelligent Data Analysis, Sept. 2007, pp. 81-92.
Ruiz C, Menasalvas E, Spiliopoulou M. C-DenStream: Using domain knowledge on a data stream. In Proc. the 12th International Conference on Discovery Science, Oct. 2009, pp. 287-301.
Liu L, Jing K, Guo Y et al. A three-step clustering algorithm over an evolving data stream. In Proc. the IEEE Int. Conf. Intelligent Computing and Intelligent Systems, Nov. 2009, pp. 160-164.
Lin J, Lin H. A density-based clustering over evolving heterogeneous data stream. In Proc. the 2nd Int. Colloquium on Computing, Communication, Control, and Management, Aug. 2009, pp. 275-277.
Isaksson C, Dunham M, Hahsler M. SOStream: Self organizing density-based clustering over data stream. In Lecture Notes in Computer Science 7376, Perner P (ed.), Springer Berlin Heidelberg, 2012, pp. 264-278.
Ntoutsi I, Zimek A, Palpanas T et al. Density-based projected clustering over high dimensional data streams. In Proc. the 12th SIAM Int. Conf. Data Mining, April 2012, pp. 987-998.
Hassani M, Spaus P, Gaber M M, Seidl T. Density-based projected clustering of data streams. In Proc. the 6th Int. Conf. Scalable Uncertainty Management, Sept. 2012, pp. 311-324.
Jia C, Tan C, Yong A. A grid and density-based clustering algorithm for processing data stream. In Proc. the 2nd Int. Conf. Genetic and Evolutionary Computing, Sept. 2008, pp. 517-521.
citation_journal_title=Journal of Convergence Information Technology; citation_title=Clustering over data streams based on grid density and index tree; citation_author=J Ren, B Cai, C Hu; citation_volume=6; citation_issue=1; citation_publication_date=2011; citation_pages=83-93; citation_doi=10.4156/jcit.vol6.issue1.11; citation_id=CR81
Yang Y, Liu Z, Zhang J et al. Dynamic density-based clustering algorithm over uncertain data streams. In Proc. the 9th Int. Conf. Fuzzy Systems and Knowledge Discovery, May 2012, pp. 2664-2670.
Amini A, Teh Ying W. DENGRIS-Stream: A density-grid based clustering algorithm for evolving data streams over sliding window. In Proc. International Conference on Data Mining and Computer Engineering, Dec. 2012, pp. 206-210.
citation_title=Clustering data streams using grid-based synopsis; citation_publication_date=2013; citation_id=CR84; citation_author=V Bhatnagar; citation_author=S Kaur; citation_author=S Chakravarthy; citation_publisher=Knowledge and Information Systems
Ruiz C, Spiliopoulou M, Menasalvas E. C-DBSCAN: Density-based clustering with constraints. In Proc. the 11th International Conference on Rough Sets, Fuzzy Sets, Data Mining and Granular Computing, May 2007, pp. 216-223.
Yang C, Zhou J. HClustream: A novel approach for clustering evolving heterogeneous data stream. In Proc. the 6th IEEE Int. Conf. Data Mining Workshops, Dec. 2006, pp. 682-688.
citation_journal_title=Biological Cybernetics; citation_title=Self-organized formation of topologically correct feature maps; citation_author=T Kohonen; citation_volume=43; citation_issue=1; citation_publication_date=1982; citation_pages=59-69; citation_doi=10.1007/BF00337288; citation_id=CR87
Bohm C, Kailing K, Kriegel H P, Kroger P. Density connected clustering with local subspace preferences. In Proc. the 4th IEEE Int. Conf. Data Mining, Nov. 2004, pp. 27-34.
Kennedy J F, Kennedy J, Eberhart R C. Swarm Intelligence. Morgan Kaufmann Pub, 2001.
citation_journal_title=Engineering Applications of Artificial Intelligence; citation_title=An appraisal and design of a multi-agent system based cooperative wireless intrusion detection computational intelligence technique; citation_author=S Shamshirband, N Anuar, M Kiah; citation_volume=26; citation_issue=9; citation_publication_date=2013; citation_pages=2105-2127; citation_doi=10.1016/j.engappai.2013.04.010; citation_id=CR90
citation_journal_title=Data Mining and Knowledge Discovery; citation_title=Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications; citation_author=J Sander, M Ester, HP Kriegel, X Xu; citation_volume=2; citation_issue=2; citation_publication_date=1998; citation_pages=169-194; citation_doi=10.1023/A:1009745219419; citation_id=CR91
citation_journal_title=NeuroImage; citation_title=Automated detection of brain atrophy patterns based on MRI for the prediction of Alzheimer’s disease; citation_author=C Plant, SJ Teipel, A Oswald; citation_volume=50; citation_issue=1; citation_publication_date=2010; citation_pages=162-174; citation_doi=10.1016/j.neuroimage.2009.11.046; citation_id=CR92
citation_journal_title=Computerized Medical Imaging and Graphics; citation_title=Fast density-based lesion detection in dermoscopy images; citation_author=M Mete, S Kockara, K Aydin; citation_volume=35; citation_issue=2; citation_publication_date=2011; citation_pages=128-136; citation_doi=10.1016/j.compmedimag.2010.07.007; citation_id=CR93
citation_journal_title=Proc. VLDB Endow.; citation_title=Summarization and matching of density-based clusters in streaming environments; citation_author=D Yang, EA Rundensteiner, MO Ward; citation_volume=5; citation_issue=2; citation_publication_date=2011; citation_pages=121-132; citation_id=CR94
citation_journal_title=Expert Systems with Applications; citation_title=Mining spatio-temporal information on microblogging streams using a density-based online clustering method; citation_author=CH Lee; citation_volume=39; citation_issue=10; citation_publication_date=2012; citation_pages=9623-9641; citation_doi=10.1016/j.eswa.2012.02.136; citation_id=CR95
citation_journal_title=Computer Science and Information Systems; citation_title=Online clustering for trajectory data stream of moving objects; citation_author=Y Yu, Q Wang, X Wang, H Wang, J He; citation_volume=10; citation_issue=3; citation_publication_date=2013; citation_pages=1293-1317; citation_doi=10.2298/CSIS120723049Y; citation_id=CR96