Distributed statistical optimization for non-randomly stored big data with application to penalized learning

Statistics and Computing - Tập 33 - Trang 1-13 - 2023

Kangning Wang¹, Shaomin Li²

¹School of Statistics, Shandong Technology and Business University, Yantai, China

²School of Mathematics and Statistics, Beijing Jiaotong University, Beijing, China

Tóm tắt

Distributed optimization for big data has recently attracted enormous attention. However, the existing algorithms are all based on one critical randomness condition, i.e., the big data are randomly distributed on different machines. This is seldom in practice, and violating this condition can seriously degrade the estimation accuracy. To fix this problem, we propose a pilot dataset surrogate loss function based optimization framework, which can realize communication-efficient distributed optimization for non-randomly distributed big data. Furthermore, we also apply it to penalized high-dimensional sparse learning problems by combining it with the penalty functions. Theoretical properties and numerical results all confirm the good performance of the proposed methods.

Tài liệu tham khảo

Battey, H., Fan, J., Liu, H., Lu, J., Zhu, Z.: Distributed testing and estimation under sparse high dimensional models. Ann. Stat. 46, 1352–1382 (2018) Chen, X., Xie, M.: A split-and-conquer approach for analysis of extraordinarily large data. Stat. Sin. 24, 1655–1684 (2014) Chen, X., Liu, W., Zhang, Y.: Quantile regression under memory constraint. Ann. Stat. 47(6), 3244–3273 (2019) Chen, L., Zhou, Y.: Quantile regression in big data: a divide and conquer based strategy. Comput. Stat. Data Anal. 144, 106892 (2020) Chen, X., Liu, W., Mao, X., Yang, Z.: Distributed high-dimensional regression under a quantile loss function. J. Mach. Learn. Res. 21, 1–43 (2020) Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360 (2001) Fan, J., Peng, H.: Nonconcave penalized likelihood with a diverging number of parameters. Ann. Stat. 32, 928–961 (2004) Fan, J., Wang, D., Wang, K., Zhu, Z.: Distributed estimation of principal eigenspaces (2017). arXiv: 1702.06488 Fan, J., Guo, Y., Wang, K.: Communication-efficient accurate statistical estimation (2019). arXiv: 1906.04870 Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010) Gopal, S., Yang, Y.: Distributed training of large-scale logistic models. International Conference on Machine Learning, pp. 289–297 (2013) Huang, C., Huo, X.: A distributed one-step estimator. Math. Program. 174, 41–76 (2019) Jordan, M.I., Lee, J.D., Yang, Y.: Communication-efficient distributed statistical inference. J. Am. Stat. Assoc. 14, 668–681 (2019) Lee, J., Sun, Y., Liu, Q., Taylor, J.: Communication-efficient sparse regression: a one-shot approach (2015). arXiv: 1503.04337 Lin, N., Xi, R.: Aggregated estimating equation estimation. Stat. Interface 4, 73–83 (2011) Shamir, O., Srebro, N., Zhang, T.: Communication-efficient distributed optimization using an approximate newton-type method. Int. Conf. Mach. Learn. 32, 1000–1008 (2014) Tibshirani, R.: Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B 58, 267–288 (1996) Wang, K., Wang, H., Li, S.: Renewable Quantile Regression for Streaming Datasets. Knowl.-Based Syst. 235, 107675 (2022) Wang, K., Li, S.: Robust distributed modal regression for massive data. Comput. Stat. Data Anal. 160, 107225 (2021) Wang, J., Kolar, M., Srebro, N., Zhang, T.: Efficient distributed learning with sparsity. Int. Conf. Mach. Learn. 70, 3636–3645 (2017) Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B 68, 49–67 (2006) Zhang, Y., Duchi, J.C., Wainwright, M.: Communication-efficient algorithms for statistical optimization. J. Mach. Learn. Res. 14, 3321–3363 (2013) Zhang, Y., Duchi, J., Wainwright, M.: Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates. J. Mach. Learn. Res. 16, 3299–3340 (2015) Zhao, T., Cheng, G., Liu, H.: A partially linear framework for massive heterogeneous data. Ann. Stat. 44, 1400–1437 (2016) Zhang, C.: Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38, 894–942 (2010) Zhu, X., Li, F., Wang, H.: Least squares approximation for a distributed system (2019). arXiv: 1908.04904 Zou, H., Li, R.: One-step sparse estimates in nonconcave penalized likelihood models (with discussion). Ann. Stat. 36, 1509–1533 (2008)

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Công cụ kiểm tra chính tả và thể thức Viver

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA