One-Pass Additive-Error Subset Selection for $$\ell _{p}$$ Subspace Approximation and (k, p)-Clustering

Springer Science and Business Media LLC - 2023

Amit Deshpande¹, Rameshwar Pratap²

¹Microsoft Research, Bangalore, Bangalore, India

²Department of Computer Science and Engineering, IIT Hyderabad, Hyderabad, India

Tóm tắt

We consider the problem of subset selection for $$\ell _{p}$$ subspace approximation and (k, p)-clustering. Our aim is to efficiently find a small subset of data points such that solving the problem optimally for this subset gives a good approximation to solving the problem optimally for the original input. For $$\ell _{p}$$ subspace approximation, previously known subset selection algorithms based on volume sampling and adaptive sampling proposed in Deshpande and Varadarajan (STOC’07, 2007), for the general case of $$p \in [1, \infty )$$ , require multiple passes over the data. In this paper, we give a one-pass subset selection with an additive approximation guarantee for $$\ell _{p}$$ subspace approximation, for any $$p \in [1, \infty )$$ . Earlier subset selection algorithms that give a one-pass multiplicative $$(1+\epsilon )$$ approximation work under the special cases. Cohen et al. (SODA’17, 2017) gives a one-pass subset section that offers multiplicative $$(1+\epsilon )$$ approximation guarantee for the special case of $$\ell _{2}$$ subspace approximation. Mahabadi et al. (STOC’20, 2020) gives a one-pass noisy subset selection with $$(1+\epsilon )$$ approximation guarantee for $$\ell _{p}$$ subspace approximation when $$p \in \{1, 2\}$$ . Our subset selection algorithm gives a weaker, additive approximation guarantee, but it works for any $$p \in [1, \infty )$$ . We also focus on (k, p)-clustering, where the task is to group the data points into k clusters such that the sum of distances from points to cluster centers (raised to the power p) is minimized for $$p\in [1, \infty )$$ . The subset selection algorithms are based on $$D^p$$ sampling due to Wei (NIPS’16, 2016) which is an extension of $$D^2$$ sampling proposed in Arthur and Vassilvitskii (SODA’07, 2007). Due to the inherently adaptive nature of the $$D^p$$ sampling, these algorithms require taking multiple passes over the input. In this work, we suggest one pass subset selection for (k, p)-clustering that gives constant factor approximation with respect to the optimal solution with an additive approximation guarantee. Bachem et al. (NIPS’16, 2016) also gives one pass subset selection for k-means for $$p=2$$ ; however, our result gives a solution for a more generic problem when $$p \in [1,\infty )$$ . At the core, our contribution lies in showing a one-pass MCMC-based subset selection algorithm such that its cost incurred due to the sampled points closely approximates the corresponding optimal cost, with high probability.

Từ khóa

Tài liệu tham khảo

Feldman, D.: Introduction to Core-sets: an Updated Survey (2020) Frieze, A.M., Kannan, R., Vempala, S.S.: Fast monte-carlo algorithms for finding low-rank approximations. J. ACM 51(6), 1025–1041 (2004). https://doi.org/10.1145/1039488.1039494 Wei, K., Iyer, R., Bilmes, J.: Submodularity in data subset selection and active learning. In: Proceedings of the 32nd International Conference on Machine Learning, vol. 37, pp. 1954–1963 (2015) Boutsidis, C., Mahoney, M.W., Drineas, P.: An improved approximation algorithm for the column subset selection problem. In: Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 968–977 (2009). SIAM Drineas, P., Mahoney, M.W., Muthukrishnan, S.: Relative-error CUR matrix decompositions. SIAM J. Matrix Anal. Appl. 30(2), 844–881 (2008). https://doi.org/10.1137/07070471X Mahoney, M.W., Drineas, P.: CUR matrix decompositions for improved data analysis. Proc. Natl. Acad. Sci. USA 106(3), 697–702 (2009). https://doi.org/10.1073/pnas.0803205106 Wang, S., Zhang, Z.: Improving CUR matrix decomposition and the nyström approximation via adaptive sampling. J. Mach. Learn. Res. 14(1), 2729–2769 (2013) Mahoney, M.W.: Randomized algorithms for matrices and data. Found. Trends Mach. Learn. 3(2), 123–224 (2011). https://doi.org/10.1561/2200000035 Ida, Y., Kanai, S., Fujiwara, Y., Iwata, T., Takeuchi, K., Kashima, H.: Fast deterministic cur matrix decomposition with accuracy assurance. In: International Conference on Machine Learning, pp. 4594–4603 (2020). PMLR Sun, J., Xie, Y., Zhang, H., Faloutsos, C.: Less is more: Compact matrix decomposition for large sparse graphs. In: Proceedings of the 2007 SIAM International Conference on Data Mining, pp. 366–377 (2007). SIAM Mahoney, M.W., Maggioni, M., Drineas, P.: Tensor-cur decompositions for tensor-based data. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 327–336 (2006) Drineas, P., Kerenidis, I., Raghavan, P.: Competitive recommendation systems. In: Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, pp. 82–90 (2002) Deshpande, A., Vempala, S.: Adaptive sampling and fast low-rank matrix approximation. In: Proceedings of the 9th International Conference on Approximation Algorithms for Combinatorial Optimization Problems, and 10th International Conference on Randomization and Computation. APPROX’06/RANDOM’06, pp. 292–303. Springer, Berlin, Heidelberg (2006). https://doi.org/10.1007/11830924_28 Deshpande, A., Rademacher, L., Vempala, S.S., Wang, G.: Matrix approximation and projective clustering via volume sampling. Theory Comput. 2(12), 225–247 (2006). https://doi.org/10.4086/toc.2006.v002a012 Guruswami, V., Sinop, A.K.: Optimal column-based low-rank matrix reconstruction. In: Proceedings of the Twenty-Third Annual ACM-SIAM Symposium on Discrete Algorithms. SODA ’12, pp. 1207–1214. Society for Industrial and Applied Mathematics, USA (2012) Derezinski, M., Warmuth, M.K.: Unbiased estimates for linear regression via volume sampling. Advances in Neural Information Processing Systems 30 (2017) Broadbent, M.E., Brown, M., Penner, K., Ipsen, I., Rehman, R.: Subset selection algorithms: Randomized vs. deterministic. SIAM Undergraduate Research Online 3(01) (2010) Dan, C., Wang, H., Zhang, H., Zhou, Y., Ravikumar, P.K.: Optimal analysis of subset-selection based $\ell _p$ low-rank approximation. Advances in Neural Information Processing Systems 32 (2019) Lloyd, S.: Least squares quantization in pcm. IEEE Trans. Inf. Theor. 28(2), 129–137 (2006). https://doi.org/10.1109/TIT.1982.1056489 Arthur, D., Vassilvitskii, S.: K-means++: The advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. SODA ’07, pp. 1027–1035. Society for Industrial and Applied Mathematics, USA (2007) Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable k-means++. Proc. VLDB Endow. 5(7), 622–633 (2012). https://doi.org/10.14778/2180912.2180915 Bachem, O., Lucic, M., Hassani, S.H., Krause, A.: Approximate k-means++ in sublinear time. Proceedings of the AAAI Conference on Artificial Intelligence 30(1) (2016). https://doi.org/10.1609/aaai.v30i1.10259 Bachem, O., Lucic, M., Hassani, S.H., Krause, A.: Fast and provably good seedings for k-means. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. NIPS’16, pp. 55–63. Curran Associates Inc., Red Hook (2016) Deshpande, A., Varadarajan, K.: Sampling-based dimension reduction for subspace approximation. In: Proceedings of the Thirty-Ninth Annual ACM Symposium on Theory of Computing. STOC ’07, pp. 641–650. Association for Computing Machinery, New York, (2007). https://doi.org/10.1145/1250790.1250884 Aloise, D., Deshpande, A., Hansen, P., Popat, P.: Np-hardness of euclidean sum-of-squares clustering. Mach. Learn. 75(2), 245–248 (2009). https://doi.org/10.1007/s10994-009-5103-0 Mahajan, M., Nimbhorkar, P., Varadarajan, K.R.: The planar k-means problem is np-hard. Theor. Comput. Sci. 442, 13–21 (2012). https://doi.org/10.1016/j.tcs.2010.05.034 Aggarwal, A., Deshpande, A., Kannan, R.: Adaptive sampling for k-means clustering, pp. 15–28 (2009). https://doi.org/10.1007/978-3-642-03685-9_2 Wei, D.: A constant-factor bi-criteria approximation guarantee for k-means++. In: Proceedings of the 30th International Conference on Neural Information Processing Systems. NIPS’16, pp. 604–612. Curran Associates Inc., Red Hook (2016) Jaiswal, R., Kumar, M., Yadav, P.: Improved analysis of d${}^{\text{2 }}$-sampling based PTAS for k-means and other clustering problems. Inf. Process. Lett. 115(2), 100–103 (2015). https://doi.org/10.1016/j.ipl.2014.07.009 Kumar, A., Sabharwal, Y., Sen, S.: A simple linear time $(1+\epsilon )$-approximation algorithm for $k$-means clustering in any dimensions. In: 45th Annual IEEE Symposium on Foundations of Computer Science, pp. 454–462 (2004). IEEE Cohen, M.B., Musco, C., Musco, C.: Input sparsity time low-rank approximation via ridge leverage score sampling. In: Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms. SODA ’17, pp. 1758–1777. Society for Industrial and Applied Mathematics, USA (2017) Mahabadi, S., Razenshteyn, I., Woodruff, D.P., Zhou, S.: Non-adaptive adaptive sampling on turnstile streams. In: Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing. STOC 2020, pp. 1251–1264. Association for Computing Machinery, New York (2020). https://doi.org/10.1145/3357713.3384331 Vitter, J.S.: Random sampling with a reservoir. ACM Trans. Math. Softw. 11(1), 37–57 (1985). https://doi.org/10.1145/3147.3165 Efraimidis, P.S., Spirakis, P.P.: Weighted random sampling. In: Encyclopedia of Algorithms, pp. 2365–2367 (2016). https://doi.org/10.1007/978-1-4939-2864-4_478 Ghashami, M., Liberty, E., Phillips, J.M., Woodruff, D.P.: Frequent directions : Simple and deterministic matrix sketching. CoRR abs/1501.01711 (2015) arXiv:1501.01711 Braverman, V., Drineas, P., Musco, C., Musco, C., Upadhyay, J., Woodruff, D.P., Zhou, S.: Near optimal linear algebra in the online and sliding window models. In: 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS), pp. 517–528 (2020). IEEE Ban, F., Bhattiprolu, V., Bringmann, K., Kolev, P., Lee, E., Woodruff, D.P.: A ptas for $\ell _p$-low rank approximation. In: Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 747–766 (2019). SIAM Chierichetti, F., Gollapudi, S., Kumar, R., Lattanzi, S., Panigrahy, R., Woodruff, D.P.: Algorithms for $\ell _p$ low-rank approximation. In: International Conference on Machine Learning, pp. 806–814 (2017). PMLR Liberty, E.: Simple and deterministic matrix sketching. In:Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’13, pp. 581–588. Association for Computing Machinery, New York, NY, USA (2013). https://doi.org/10.1145/2487575.2487623 Ghashami, M., Phillips, J.M.: Relative errors for deterministic low-rank matrix approximations. In:Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms. SODA ’14, pp. 707–717. Society for Industrial and Applied Mathematics, USA (2014) Ghashami, M., Liberty, E., Phillips, J.M., Woodruff, D.P.: Frequent directions: Simple and deterministic matrix sketching. SIAM J. Comput. 45(5), 1762–1792 (2016). https://doi.org/10.1137/15M1009718 Cormode, G., Dickens, C., Woodruff, D.: Leveraging well-conditioned bases: Streaming and distributed summaries in minkowski p-norms. In: International Conference on Machine Learning, pp. 1048–1056 (2018) Clarkson, K.L., Woodruff, D.P.: Low-rank approximation and regression in input sparsity time. J. ACM 63(6) (2017). https://doi.org/10.1145/3019134 Feldman, D., Monemizadeh, M., Sohler, C., Woodruff, D.P.: Coresets and sketches for high dimensional subspace approximation problems. In: Proceedings of the Twenty-first Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 630–649 (2010). SIAM Ailon, N., Jaiswal, R., Monteleoni, C.: Streaming k-means approximation. Adv Neural Inform Process Syst22 (2009) Jaiswal, R., Kumar, A., Sen, S.: A simple D 2-sampling based PTAS for k-means and other clustering problems. Algorithmica 70(1), 22–46 (2014). https://doi.org/10.1007/s00453-013-9833-9 Anari, N., Gharan, S.O., Rezaei, A.: Monte carlo markov chain algorithms for sampling strongly rayleigh distributions and determinantal point processes. In: Conference on Learning Theory, pp. 103–115 (2016). PMLR CHAO, M.T.: A general purpose unequal probability sampling plan. Biometrika 69(3), 653–656 (1982) https://academic.oup.com/biomet/article-pdf/69/3/653/591311/69-3-653.pdf. https://doi.org/10.1093/biomet/69.3.653 Cai, H.: Exact bound for the convergence of metropolis chains. Stochastic Anal Appl 18(1), 63–71 (2000). https://doi.org/10.1080/07362990008809654

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Công cụ kiểm tra chính tả và thể thức Viver

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA