An incremental algorithm for clustering spatial data streams: exploring temporal locality

Knowledge and Information Systems - Tập 37 - Trang 453-483 - 2013
Ling-Yin Wei1,2, Wen-Chih Peng1
1Department of Computer Science, National Chiao Tung University, Hsinchu, Taiwan
2Institute of Information Science, Academia Sinica, Taipei, Taiwan

Tóm tắt

Clustering sensor data discovers useful information hidden in sensor networks. In sensor networks, a sensor has two types of attributes: a geographic attribute (i.e, its spatial location) and non-geographic attributes (e.g., sensed readings). Sensor data are periodically collected and viewed as spatial data streams, where a spatial data stream consists of a sequence of data points exhibiting attributes in both the geographic and non-geographic domains. Previous studies have developed a dual clustering problem for spatial data by considering similarity-connected relationships in both geographic and non-geographic domains. However, the clustering processes in stream environments are time-sensitive because of frequently updated sensor data. For sensor data, the readings from one sensor are similar for a period, and the readings refer to temporal locality features. Using the temporal locality features of the sensor data, this study proposes an incremental clustering (IC) algorithm to discover clusters efficiently. The IC algorithm comprises two phases: cluster prediction and cluster refinement. The first phase estimates the probability of two sensors belonging to a cluster from the previous clustering results. According to the estimation, a coarse clustering result is derived. The cluster refinement phase then refines the coarse result. This study evaluates the performance of the IC algorithm using synthetic and real datasets. Experimental results show that the IC algorithm outperforms exiting approaches confirming the scalability of the IC algorithm. In addition, the effect of temporal locality features on the IC algorithm is analyzed and thoroughly examined in the experiments.

Tài liệu tham khảo

Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of the 29th international conference on very large data bases (VLDB), pp 81–92 Aggarwal CC, Yu PS (2010) On clustering massive text and categorical data streams. Knowl Inf Syst 24(2):171–196 Aghabozorgi SR, Saybani MR, Wah TY (2012) Incremental clustering of time-series by fuzzy clustering. J Inf Sci Eng (JISE) 28(4):671–688 Alïtelhadj A, Boughanem M, Mezghiche M, Souam F (2012) Using structural similarity for clustering xml documents. Knowl Inf Syst (KAIS) 32(1):109–139 Bagnall AJ, Janacek GJ (2004) Clustering time series from arma models with clipped data. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 49–58 Banerjee A, Ghosh J (2006) Scalable clustering algorithms with balancing constraints. Data Min Knowl Discov 13(3):365–395 Beringer J, HLullermeier E (2006) Online clustering of parallel data streams. Data Knowl Eng (DKE) 58(2):180–204 Borovkova S, Permana FJ (2004) Modelling electricity prices by the potential jumpdiffusion. In: Proceedings of the autumn school and international conference on stochastic finance (StochFin), pp 239–263 Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the sixth SIAM international conference on data mining (SDM), pp 326–337 Cartea A, Figueroa MG (2005) Pricing in electricity markets a mean reverting jump diffusion model with seasonality. Appl Math Finance 12(4):313–335 Costa G, Manco G, Ortale R (2010) Density-based clustering of data streams at multiple resolutions. Data Min Knowl Discov (DMKD) 20(1):152–187 Dai BR, Lin CR, Chen MS (2007) Constrained data clustering by depth control and progressive constraint relaxation. Int J Very Large Data Bases 16(2):201–217 Davidson I, Ester M, Ravi SS (2007) Efficient incremental constrained clustering. In: Proceedings of the 13th ACM international conference on knowledge discovery and data mining (SIGKDD), pp 240–249 Davidson I, Ravi SS (2005) Clustering with constraints: Feasibility issues and the k-means algorithm. In: Proceedings of the fifth SIAM international conference on data mining (SDM), pp 138–149 Ester M, Kriegel HP, Sander J, Wimmer M, Xu X (1998) Incremental clustering for mining in a data warehousing environment. In: Proceedings of the 24th international conference on very large data bases (VLDB), pp 323–333 Ge R, Ester M, Gao BJ, Hu Z, Bhattacharya B, Ben-Moshe B (2008) Joint cluster analysis of attribute data and relationship data: the connected k-center problem, algorithms and applications. ACM Trans Knowl Discov Data 2(2):7:1–7:35 Ge R, Ester M, Jin W, Davidson I (2007) Constraint-driven clustering. In: Proceedings of the 13th ACM international conference on knowledge discovery and data mining(SIGKDD), pp 320–329 Guha S, Meyerson A, Mishra N, Motwani R, OCallaghan L (2003) Clustering data streams: theory and practice. IEEE Trans Knowl Data Eng (TKDE) 15(3):515–528 Guo L, Ai C, Wang X, Cai Z, Li Y (2009) Real time clustering of sensory data in wireless sensor networks. In: Proceedings of the 28th IEEE international performance computing and communications conference (IPCCC), pp 33–40 Halkidi M, Spiliopoulou M, Pavlou A (2012) A semi-supervised incremental clustering algorithm for streaming data. In: Proceedings of the 16th Pacific-Asia conference on advances in knowledge discovery and data mining (PAKDD), pp 578–590 Han J, Kamber M (2000) Data mining: concepts and techniques. Morgan Kaufmann, New York Huang J, Zhang J (2010) Distributed dual cluster algorithm based on grid for sensor streams. Int J Digit Content Technol Appl (JDCTA) 4(9):225–233 Kavitha V, Punithavalli M (2010) Clustering time series data stream—a literature survey. ACM Trans Knowl Discov Data (IJCSIS) 8(1):289–294 Klein D, Kamvar SD, Manning CD (2002) From instance-level constraints to space-level constraints: making the most of prior knowledge in data clustering. In: Proceedings of the 19th international conference on machine learning (ICML), pp 307–314 Liao ZX, Peng WC (2012) Clustering spatial data with a geographic constraint: exploring local search. Knowl Inf Syst 31(1):153–170 Lin CR, Liu KH, Chen MS (2005) Dual clustering: integrating data clustering over optimization and constraint domains. IEEE Trans Knowl Data Eng 17(5):628–637 Lin J, Vlachos M, Keogh EJ, Gunopulos D (2004) Iterative incremental clustering of time series. In: Proceedings of the ninth international conference on extending database technology (EDBT), pp 106V122 Lühr S, Lazarescu M (2009) Incremental clustering of dynamic data streams using connectivity based representative points. Data Knowl Eng (DKE) 68(1):1–27 Mirkin B (2011) Clustering for data mining: a data recovery approach. Taylor and Francis, London O’Callaghan L, Meyerson A, Motwani R, Mishra N, Guha S (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings of the 18th IEEE international conference on data engineering (ICDE), pp 685–694 Pensa RG, Ienco D, Meo R (2012) Hierarchical co-clustering: off-line and incremental approaches. Data Mining Knowl Discov (DMKD). doi:10.1007/s10618-012-0292-8 Robert CP, Casella G (2004) Monte carlo statistical methods. Springer, New York Rodrigues PP, Gama J, Lopes L (2008) Clustering distributed sensor data streams. In: Proceedings of the European conference on machine learning and knowledge discovery in databases—part II (ECML PKDD), pp 282–297 Rodrigues PP, Gama J, Pedroso JP (2008) Hierarchical clustering of time series data streams. IEEE Trans Knowl Data Eng (TKDE) 20(5):615–627 Shi YB, Yuan CA, Huang Y, Wen YG (2010) A method of spatial clustering based on the combination of the spatial coordinate and attributes. In: Proceedings of the sixth international conference on information systems security (ICISS), pp 526–529 Shi Y, Zhang L (2010) Coid: A clustervoutlier iterative detection approach to multi-dimensional data analysis. Knowl Inf Sys. doi:10.1007/s10115-010-0323-y Tai CH, Dai BR, Chen MS (2007) Incremental clustering in geography and optimization spaces. In: Proceedings of the 11th Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 272–283 Taiwan area national freeway bureau. http://www.freeway.gov.tw/ Tan PN, Steinbach M, Kumar V (2006) Introduction to data mining. Addison-Wesley, New York Wagstaff K, Cardie C (2000) Clustering with instance-level constraints. In: Proceedings of the 17th international conference on machine learning (ICML), pp 1103–1110 Wan L, Ng WK, Dang XH, Yu PS, Zhang K (2009) Density-based clustering of data streams at multiple resolutions. ACM Trans Knowl Discov Data (TKDD) 3(3):1–28 Wang JW, Cheng CH (2007) An efficient method for estimating null values in relational databases. Knowl Inf Syst 12(3):379–394 Wei LY, Peng WC (2009) Clustering data streams in optimization and geography domains. In: Proceedings of the 13th Pacific-Asia conference on knowledge discovery and data mining (PAKDD), pp 997–1005 Yang J (2003) Dynamic clustering of evolving streams with a single pass. In: Proceedings of the 19th IEEE international conference on data engineering (ICDE), pp 695–697 Zhou J, Guan J, Li P (2007c) Dcad: a dual clustering algorithm for distributed spatial databases. Geo-Spatial Inf Sci 10(2):137–144 Zhou A, Cao F, Yan Y, Sha C, He X (2007) Density-based clustering for real-time stream data. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 133–142 Zhou A, Cao F, Yan Y, Sha C, He X (2007) Distributed data stream clustering: a fast em-based approach. In: Proceedings of the 23rd international conference on data engineering (ICDE), pp 736–745