Continuous Outlier Monitoring on Uncertain Data Streams
Tóm tắt
Outlier detection on data streams is an important task in data mining. The challenges become even larger when considering uncertain data. This paper studies the problem of outlier detection on uncertain data streams. We propose Continuous Uncertain Outlier Detection (CUOD), which can quickly determine the nature of the uncertain elements by pruning to improve the efficiency. Furthermore, we propose a pruning approach — Probability Pruning for Continuous Uncertain Outlier Detection (PCUOD) to reduce the detection cost. It is an estimated outlier probability method which can effectively reduce the amount of calculations. The cost of PCUOD incremental algorithm can satisfy the demand of uncertain data streams. Finally, a new method for parameter variable queries to CUOD is proposed, enabling the concurrent execution of different queries. To the best of our knowledge, this paper is the first work to perform outlier detection on uncertain data streams which can handle parameter variable queries simultaneously. Our methods are verified using both real data and synthetic data. The results show that they are able to reduce the required storage and running time.
Tài liệu tham khảo
Niennattrakul V, Keogh E, Ratanamahatana C A. Data edit-ing techniques to allow the application of distance-based outlier detection to streams. In Proc. the 10th International Conference on Data Mining, December 2010, pp.947-952.
Jin C Q, Zhang J W, Zhou A Y. Continuous ranking on uncertain streams. Frontiers of Computer Science, 2012, 6(6): 686-699.
Zhang C, Gao M, Zhou A Y. Tracking high quality clusters over uncertain data streams. In Proc. the 25th Int. Conf. Data Engineering, March 29-April 2, 2009, pp.1641-1648.
Aggarwal C C. On density based transforms for uncertain data mining. In Proc. the 23rd International Conference on Data Engineering, April 2007, pp.866-875.
Barbar D, Garcia-Molina H, Porter D. The management of probabilistic data. IEEE Transactions on Knowledge and Data Engineering, 1992, 4(5): 487-502.
Burdick D, Deshpande P M, Jayram T S, Ramakrishnan R, Vaithyanathan S. OLAP over uncertain and imprecise data. In Proc. the 31st Int. Conf. Very Large Data Bases, August 2005, pp.970-981.
Cheng R, Kalashnikov D V, Prabhakar S. Evaluating probabilistic queries over imprecise data. In Proc. International Conference on Management of Data, June 2003, pp.551-562.
Sarma A D, Benjelloun O, Halevy A, Widom J.Working models for uncertain data. In Proc. the 22nd International Conference on Data Engineering, April 2006, p.7.
Singh S, Mayfield C, Prabhakar S, Shah R, Hambrusch S. Indexing uncertain categorical data. In Proc. the 23rd Int. Conf. Data Engineering, April 2007, pp.616-625.
Tao Y, Cheng R, Xiao X, Ngai W K, Kao B, Prabhakar S. Indexing multi-dimensional uncertain data with arbitrary probability density functions. In Proc. the 31st Int. Conf. Very Large Data Bases, August 2005, pp.922-933.
Chen M, Yu G, Gu Y, Jia Z X, Wang Y Q. An efficient method for cleaning dirty-events over uncertain data in WSNs. J. Computer Science and Technology, 2011, 26(6): 942-953.
Yang D, Rundensteiner E A, Ward M O. Neighbor-based pattern detection for windows over streaming data. In Proc. the 12th International Conference on Extending Database Technology, March 2009, pp.529-540.
Aggarwal C C, Han J, Wang J, Yu P S. A framework for clustering evolving data streams. In Proc. the 29th Int. Conf. Very Large Data Bases, September 2003, pp.81-92.
Babcock B, Babu S, Datar M, Motwani R, Widom J. Models and issues in data stream systems. In Proc. the 21st ACM SIGMOD-SIGART-SIGACT Symposium on Principles of Database Systems, June 2002, pp.1-16.
Knorr E M, Ng R T. Algorithms for mining distance-based outliers in large datasets. In Proc. the 24th International Conference on Very Large Data Bases, August 1998, pp.392-403.
Angiulli F, Fassetti F. Detecting distance-based outliers in streams of data. In Proc. the 16th International Conference on Information and Knowledge Management, November 2007, pp.811-820.
Kontaki M, Gounaris A, Papadopoulos A N et al. Continuous monitoring of distance-based outliers over data streams. In Proc. the 27th International Conference on Data Engineering, April 2011, pp.135-146.
Assent I, Kranen P, Baldauf C, Seidl T. AnyOut: Anytime outlier detection on streaming data. In Proc. the 17th International Conference on Databases Systems for Advanced Applications, Vol.1, April 2012, pp.228-242.
Aggarwal C C, Yu P S. Outlier detection with uncertain data. In Proc. SIAM Int. Conf. Data Mining, April 2008, pp.483-493.
Wang B, Xiao G, Yu H, Yang X. Distance-based outlier detection on uncertain data. In Proc. the 9th Int. Conf. Comp. and Information Technology, October 2009, pp.293-298.
Jiang B, Pei J. Outlier detection on uncertain data: Objects, instances, and inferences. In Proc. the 27th International Conference on Data Engineering, April 2011, pp.422-433.
Wang B, Yang X C, Wang G R, Yu G. Outlier detection over sliding windows for probabilistic data streams. Journal of Computer Science and Technology, 2010, 25(3): 389-400.
Cao K Y, Han D H, Wang G R, et al. An algorithm for outlier detection on uncertain data stream. In Proc. the 15th Asia-Pacific Web Conference, April 2013, pp.449-460.
Yan C, Chen G L, Shen Y F. Outlier analysis for gene expression data. Journal of Computer Science and Technology, 2004, 19(1): 13-21.
Knorr E M, Ng R T. Finding intensional knowledge of distance-based outliers. In Proc. the 25th International Conference on Very Large Data Bases, Sept. 2009, pp.211-222.
Das Sarma A, Benjelloun O, Halevy A, Widom J. Working models for uncertain data. In Proc. the 22nd International Conference on Data Engineering, April 2006, p.7.