Các Thuật Toán Phân Cụm Tổng Quát Nhanh và Đáng Tin Cậy

Data Mining and Knowledge Discovery - Tập 8 - Trang 127-150 - 2004
V. Estivill-Castro1, J. Yang2
1School of Computing and Information Technology, Griffith University, Nathan, Australia
2School of Computing and Information Technology, University of Western Sydney, Campbelltown, Australia

Tóm tắt

Các phương pháp phân cụm với mục đích chung và có tính ứng dụng cao thường được yêu cầu trong các giai đoạn đầu của các hoạt động khám phá tri thức. k-MEANS đã được áp dụng như một mẫu cho phân cụm dựa trên mô hình lặp vì tốc độ, sự đơn giản và khả năng làm việc trong các định dạng cơ sở dữ liệu rất lớn. Tuy nhiên, k-MEANS có một số nhược điểm xuất phát từ sự đơn giản thống kê của nó. Chúng tôi đề xuất một thuật toán vẫn rất hiệu quả, có thể áp dụng chung, đa chiều nhưng đáng tin cậy hơn với tiếng ồn và dữ liệu ngoại lai. Chúng tôi đạt được điều này bằng cách sử dụng trị số trung vị thay vì trị trung bình làm ước lượng cho các trung tâm cụm. So sánh với k-MEANS, lấy mẫu KỲ VỌNG và TỐI ĐA cho thấy những lợi thế của thuật toán của chúng tôi.

Từ khóa

#phân cụm #thuật toán k-MEANS #trung vị #tiếng ồn #dữ liệu ngoại lai

Tài liệu tham khảo

Aldenderfer, M.S. and Blashfield, R.K. 1984. Cluster Analysis. Beverly Hills, USA: Sage Publications. Apostol, T.M. 1969. Calculus—Volume II, second edition. NY, USA: John Wiley & Sons. Arnold, S.F. 1993. Gibbs sampling. In Handbook of Statistics 9, C.R. Rao (Ed.), Amsterdam, North Holland, pp. 599–625. Arora, S., Raghavan, P., and Rao, S. 1998. Approximation schemes for Euclidean k-medians and related problems. In Proceedings of the 30th Annual ACM Symposium on the Theory of Computing (STOC). Dallas; TX, ACM, ACM Press. ISBN: 0897919629, pp. 106–113. Bajaj, C. 1986. Proving geometric algorithm non-solvability: An application of factoring polynomials. Journal of Symbolic Computation, 2:99–102. Banfield, J.D. and Raftery, A.E. 1993. Model-Based Gaussian and non-Gaussian clustering. Biometrics, 49:803–821. Berry, M.J.A. and Linoff, G. 1997. Data Mining Techniques—For Marketing, Sales and Customer Support. NY, USA: John Wiley & Sons. Bezdek, J.C. 1981. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York. Bradley, P.S. and Fayyad, U. 1998. Refining the initial points in k-means clustering. In Proceedings of the Fifteenth International Conference on Machine Learning. San Mateo, CA. Morgan Kaufmann Publishers, pp. 91–99. Bradley, P.S., Fayyad, U., and Reina, C. 1998. Scaling clustering algorithms to large databases. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining. R. Agrawal and P. Stolorz (Eds.), AAAI Press, pp. 9–15. Bradley, P.S., Mangasarian, O.L., and Street,W.N. 1997. Clustering via concave minimization. Advances in neural information processing systems, 9:368–374. Bulmer M.G. 1979. Principles of Statistics, second edition, Dover, NY. Casela, G. and Berger, R.L. 1990. Statistical Inference. CA: Wadsworth & Brooks/Cole, Belmont. Celeux, G. and Govaret, G. 1995. Gaussian parsimonious clustering models. Pattern Recognition, 28(5):781-793. Cherkassky, V. and Muller, F. 1998. Learning from Data—Concept, Theory and Methods. NY, USA: John Wiley & Sons. Dempster, A.P., Laird, N.M., and Rubin, D.B. 1977. Maximum likehood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B, 39:1–38. Dowe, D., Baxter, R.A., Oliver, J.J., and Wallace, C. 1998. Point estimation using the Kullback-Leibler loss function and MML. In Proceedings of Second Pacific-Asia Conference on Knowledge Discovery and Data Mining PAKDD-98. X.Wu, R. Kotagiri, and K.K. Korb (Eds.), Melbourne, Australia, Springer-Verlag Lecture Notes in Artificial Intelligence 1394, pp. 87–95. Duda, R.O. and Hart, P.E. 1973. Pattern Classification and Scene Analysis. NY, USA: John Wiley & Sons. Estivill-Castro, V. 2002. Why so many clustering algorithms—A position paper. SIGKDD Explorations, 4(1):65–75. Estivill-Castro, V. and Houle, M.E. 2001. Robust distance-based clustering with applications to spatial data mining. Algorithmica, 30(2):216–242. Estivill-Castro, V. and Murray, A.T. 1998. Discovering associations in spatial data—An efficient medoid based approach. In Proceedings of the 2nd Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD-98), X.Wu, R. Kotagiri, and K.K. Korb (Eds.), Melbourne, Australia, Springer-Verlag, Lecture Notes in Artificial Intelligence 1394, pp. 110–121. Everitt, B. 1980. Cluster Analysis, 2nd edition. New York, USA: Halsted Press. Fayyad, U., Reina, C., and Bradley, P.S. 1998.Initialization of iterative refinement clustering algorithms. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, R. Agrawal and P. Stolorz (Eds.), AAAI Press, pp. 194–198. Fraley, C. and Raftery, A.E. 1998. How many clusters? which clustering method? answers via model-based cluster analysis. Computer Journal, 41(8):578–588. Francis, R.L. 1974. Facility Layout and Location: An Analytical Approach. Englewood Cliffs, NJ: Prentice-Hall, Inc. Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. 1995. Bayesian Data Analysis. London: Chapman & Hall. Graham, R.L., Knuth, D.E., and Patashnik, O. 1989. Concrete Mathematics. Reading, MA: Addison-Wesley Publishing Co. Gupta, S.K., Rao, K.S. and Bhatnagar, V. 1999. K-means clustering algorithm for categorical attributes. In Data Warehousing and Knowledge Discovery DaWaK-99, M. Mohania and A.M. Tjoa (Eds.), Florence, Italy. Springer-Verlag Lecture Notes in Computer Science 1676, pp. 203–208. Huang, Z. 1998. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3):283–304. Jordan, M.I. and Jacobs, R.I. 1993. Supervised learning and divide-and-conquer: A statistical approach. In Proceedings of the Tenth International Conference on Machine Learning, San Mateo, CA, Morgan Kaufmann Publishers, pp. 159–166. Kaufman, L. and Rousseuw, P.J. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. NY, USA: John Wiley & Sons. Kuhn, H.W. 1967. On a pair of dual non-linear problems. In J. Abadie and S.Vajda (Eds.), Nonlinear programming, page Chapter 3, NY, USA. John Wiley & Sons. Kuhn, H.W. 1973.A note on Fermat's problem. Mathematical Programming, 4(1):98–107. Kuhn, H.W. and Kuenne, E. 1962.An efficient algorithm for the numerical solution of the generalized Weber problem in spatial economics. Journal of Regional Science, 4(2):21–33. MacQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In 5th Berkley Symposium on Mathematical Statistics and Probability, L. Le Cam and J. Neyman (Eds.), Volume 1, pp. 281–297. Massa, S., Paolucci, M., and Puliafito, P.P. 1999. A new modelling technique based on Markov chains to mine behavioral patterns in event based time series. In Data Warehousing and Knowledge Discovery DaWaK-99, M. Mohania and A.M. Tjoa (Eds.), Florence, Italy. Springer-Verlag Lecture Notes in Computer Science 1676, pp. 331–342. Meilijson, I. 1989. A fast improvement to the EM algorithm in its own terms. Journal of the Royal Statistical Society B, 51(1):127–138. Murray, A.T. and Estivill-Castro, V. 1998. Cluster discovery techniques for exploratory spatial data analysis. International Journal of Geographic Information Systems, 12(5):431–443. Ng, R.T. and Han, J. 1994. Efficient and effective clustering methods for spatial data mining. In Proceedings of the 20th Conference on Very Large Data Bases (VLDB), J. Bocca, M. Jarke, and C. Zaniolo (Eds.), San Francisco, CA: Santiago, Chile, Morgan Kaufmann Publishers, pp. 144–155. Oliver, J.J., Baxter, R.A., and Wallace, C.S. 1996. Unsupervised learning using MML. In Proceedings of the 13th Machine Learning Conference, L. Saitta (Ed.), San Mateo, CA, Morgan Kaufmann Publishers, pp. 364–372. Overton, M.L. 1983. A quadratically convergent method for minimizing a sum of Euclidean norms. Mathematical Programming, 27:34–63. Rogers, G.W., Wallet, B.C., and Wegman, E.J. 1997. A mixed measure formulation of the EM algorithm for huge data set applications. In Proceedings of the 28th Symposium on the Interface Between Computer Science and Statistics, L. Billard and N.I. Fisher (Eds.), Sydney, Australia, Interface Foundation of North America, pp. 492–497. Rousseeuw, P.J. and Leroy, A.M. 1987. Robust Regression and Outlier Detection. NY, USA: John Wiley & Sons. Selim, S.Z. and Ismail, M.A. 1984. k-means-type algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-6(1):81–86. Smith, A.F.M. and Roberts, G.O. 1993. Bayesian computation via the Gibbs sampler and reated Markov chain Monte Carlo methods. Journal of the Royal Statistical Society B, 55(1):2–23. Tanner, M.A. 1993. Tools for Statistical Inference. Springer-Verlag, NY, US. Titterington, D.M., Smith, A.F.M., and Makov, U.E. 1985. Statistical Analysis of Finite Mixture Distributions. UK: John Wiley & sons. Wallace, C.S. and Freeman, P.R. 1987. Estimation and inference by compact coding. Journal of the Royal Statistical Society, Series B, 49(3):240–265. Wesolowsky, G. 1993. The Weber problem: History and perspectives. Location Science, 1:5–23. Zhang, B., Hsu, M., and Dayal, U. 2007. K-harmonic means—a spatial clustering algorithm with boosting. In J. Roddick and K. Hornsby (Eds.), Proceedings of the International Workshop on Temporal, Spatial and Spatio-Temporal Data Mining—TSDM2000, in conjunction with the 4th European Conference on Principles and Practices of Knowledge Discovery and Databases, Lyon, France, 2000. Springer-Verlag Lecture Notes in Artificial Intelligence, pp. 31–42.