Probabilistic k-Median Clustering in Data Streams

Theory of Computing Systems - Tập 56 - Trang 251-290 - 2014

Christiane Lammersen¹, Melanie Schmidt², Christian Sohler²

¹Simon Fraser University, Burnaby, Canada

²TU Dortmund University, Dortmund, Germany

Tóm tắt

The focus of our work is introducing and constructing probabilistic coresets. A probabilistic coreset can contain probabilistic points, and the number of these points should be polylogarithmic in the input size. However, the overall storage size is also influenced by representation size of the propability distribution of each point. So, our first observation is that the size of probabilistic coresets shall be restricted in the number of points and in the representation size of the points. We propose the first (k, ε)-coreset constructions for the probabilistic k-median problem in the metric and Euclidean case. The coresets are of size poly(ε −1, k, log(W/(p min⋅δ))), where W is the expected total weight of the weighted probabilistic input points when all weights are scaled to be at least one, p min is the probability of a point to be realized at a certain location, and δ is the error probability of the construction. Our coreset for the Euclidean problem can be maintained in data streams.

Tài liệu tham khảo

Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Approximating extent measures of points. J. ACM 51(4), 606–635 (2004) Aggarwal, C.C., Yu, P.S.: A survey of uncertain data algorithms and applications. IEEE Trans. Knowl. Data Eng. 21(5), 609–623 (2009) Arora, S.: Polynomial time approximation schemes for Euclidean traveling salesman and other geometric problems. J. ACM 45(5), 753–782 (1998) Arora, S., Raghavan, P., Rao, S.: Approximation schemes for Euclidean k-medians and related problems. In: Proceedings 30th STOC, pp. 106–113 (1998) Arya, V., Garg, N., Khandekar, R., Meyerson, A., Munagala, K., Pandit, V.: Local search heuristics for k-median and facility location problems. SIAM J. Comput. 33(3), 544–562 (2004) Bȧdoiu, M., Har-Peled, S., Indyk, P.: Approximate clustering via core-sets. In: Proceedings 34th STOC, pp. 250–257 (2002) Bentley, J.L., Saxe, J.B.: Decomposable searching problems I: Static-to-dynamic transformation. J. Algorithm. 1(4), 301–358 (1980) Charikar, M., Guha, S.: Improved combinatorial algorithms for facility location problems. SIAM J. Comput 34(4), 803–824 (2005) Charikar, M., Guha, S., Tardos, É., Shmoys, D.B.: A constant-factor approximation algorithm for the k-median problem. J. Comput. Syst. Sci. 65(1), 129–149 (2002) Chau, M., Cheng, R., Kao, B., Ng, J.: Uncertain data mining: an example in clustering location data. In: Proceedings 10th PAKDD, pp. 199–204 (2006) Chen, K.: On coresets for k-median and k-means clustering in metric and euclidean spaces and their applications. SIAM J. Comput. 39(3), 923–947 (2009) Cormode, G., McGregor, A.: Approximation algorithms for clustering uncertain data. In: Proceedings 27th PODS, pp. 191–200 (2008) Edmonds, J., Karp, R.M.: Theoretical improvements in algorithmic efficiency for network flow problems. J. ACM 19(2), 248–264 (1972) Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings 2nd ACM SIGKDD, pp. 226–231 (1996) Feldman, D., Monemizadeh, M., Sohler, C.: A PTAS for k-means clustering based on weak coresets. In: Proceedings 23rd SoCG, pp. 11–18 (2007) Forgey, E.: Cluster analysis of multivariate data: Efficiency vs. interpretability of classification. Biometrics 768(21) (1965) Frahling, G., Sohler, C.: Coresets in dynamic geometric data streams. In: Proceedings 37th STOC, pp. 209–217 (2005) Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: theory and practice. IEEE Trans. Knowl. Data Eng. 15(3), 515–528 (2003) Guha, S., Munagala, K.: Exceeding expectations and clustering uncertain data. In: Proceedings 28th PODS, pp. 269–278 (2009) Günnemann, S., Kremer, H., Seidl, T.: Subspace clustering for uncertain data. In: Proceedings SIAM International Conference on Data Mining, pp. 385–396 (2010) Har-Peled, S., Kushal, A.: Smaller coresets for k-median and k-means clustering. Discret. Comput. Geom. 37(1), 3–19 (2007) Har-Peled, S., Mazumdar, S.: On coresets for k-means and k-median clustering. In: Proceedings 36th STOC, pp. 291–300 (2004) Haussler, D.: Decision theoretic generalizations of the pac model for neural net and other learning applications. Inf. Comput. 100(1), 78–150 (1992) Indyk, P.: Sublinear time algorithms for metric space problems. In: Proceedings 31st STOC, pp. 428–434 (1999) Jain, K., Mahdian, M., Saberi, A.: A new greedy approach for facility location problems. In: Proceedings 34th STOC, pp. 731–740 (2002) Jain, K., Vazirani, V.V.: Primal-dual approximation algorithms for metric facility location and k-median problems. In: Proceedings 40th FOCS, pp. 2–13 (1999) Kolliopoulos, S.G., Rao, S.: A nearly linear-time approximation scheme for the Euclidean k-median problem. SIAM J. Comput. 37(3), 757–782 (2007) Kriegel, H.P., Pfeifle, M.: Density-based clustering of uncertain data. In: Proceedings 11th ACM SIGKDD, pp. 672–677 (2005) Kriegel, H.P., Pfeifle, M.: Hierarchical density-based clustering of uncertain data. In: IEEE International Conference on Data Mining (ICDM), pp. 689–692 (2005) Kumar, A., Sabharwal, Y., Sen, S.: Linear-time approximation schemes for clustering problems in any dimensions. J. ACM 57(2) (2010) Lloyd, S.P.: Least squares quantization in pcm. IEEE Trans. Inf. Theory 28(2), 129–136 (1982) Mettu, R.R., Plaxton, C.G.: Optimal time bounds for approximate clustering. Mach. Learn. 56(1-3), 35–60 (2004) Ngai, W.K., Kao, B., Chui, C.K., Cheng, R., Chau, M., Yip, K.Y.: Efficient clustering of uncertain data. In: Proceedings 6th IEEE ICDM, pp. 436–445 (2006) Rubner, Y., Tomasi, C., Guibas, L.J.: A metric for distributions with applications to image databases. In: Proceedings 6th ICCV, pp. 59–66 (1998) Xu, H., Li, G.: Density-based probabilistic clustering of uncertain data. In: Proceedings 1st CSSE, vol. 4, pp. 474–477 (2008)

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA