Distributed statistical optimization for non-randomly stored big data with application to penalized learning

Statistics and Computing - Tập 33 - Trang 1-13 - 2023
Kangning Wang1, Shaomin Li2
1School of Statistics, Shandong Technology and Business University, Yantai, China
2School of Mathematics and Statistics, Beijing Jiaotong University, Beijing, China

Tóm tắt

Distributed optimization for big data has recently attracted enormous attention. However, the existing algorithms are all based on one critical randomness condition, i.e., the big data are randomly distributed on different machines. This is seldom in practice, and violating this condition can seriously degrade the estimation accuracy. To fix this problem, we propose a pilot dataset surrogate loss function based optimization framework, which can realize communication-efficient distributed optimization for non-randomly distributed big data. Furthermore, we also apply it to penalized high-dimensional sparse learning problems by combining it with the penalty functions. Theoretical properties and numerical results all confirm the good performance of the proposed methods.

Tài liệu tham khảo

Battey, H., Fan, J., Liu, H., Lu, J., Zhu, Z.: Distributed testing and estimation under sparse high dimensional models. Ann. Stat. 46, 1352–1382 (2018) Chen, X., Xie, M.: A split-and-conquer approach for analysis of extraordinarily large data. Stat. Sin. 24, 1655–1684 (2014) Chen, X., Liu, W., Zhang, Y.: Quantile regression under memory constraint. Ann. Stat. 47(6), 3244–3273 (2019) Chen, L., Zhou, Y.: Quantile regression in big data: a divide and conquer based strategy. Comput. Stat. Data Anal. 144, 106892 (2020) Chen, X., Liu, W., Mao, X., Yang, Z.: Distributed high-dimensional regression under a quantile loss function. J. Mach. Learn. Res. 21, 1–43 (2020) Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360 (2001) Fan, J., Peng, H.: Nonconcave penalized likelihood with a diverging number of parameters. Ann. Stat. 32, 928–961 (2004) Fan, J., Wang, D., Wang, K., Zhu, Z.: Distributed estimation of principal eigenspaces (2017). arXiv: 1702.06488 Fan, J., Guo, Y., Wang, K.: Communication-efficient accurate statistical estimation (2019). arXiv: 1906.04870 Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010) Gopal, S., Yang, Y.: Distributed training of large-scale logistic models. International Conference on Machine Learning, pp. 289–297 (2013) Huang, C., Huo, X.: A distributed one-step estimator. Math. Program. 174, 41–76 (2019) Jordan, M.I., Lee, J.D., Yang, Y.: Communication-efficient distributed statistical inference. J. Am. Stat. Assoc. 14, 668–681 (2019) Lee, J., Sun, Y., Liu, Q., Taylor, J.: Communication-efficient sparse regression: a one-shot approach (2015). arXiv: 1503.04337 Lin, N., Xi, R.: Aggregated estimating equation estimation. Stat. Interface 4, 73–83 (2011) Shamir, O., Srebro, N., Zhang, T.: Communication-efficient distributed optimization using an approximate newton-type method. Int. Conf. Mach. Learn. 32, 1000–1008 (2014) Tibshirani, R.: Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B 58, 267–288 (1996) Wang, K., Wang, H., Li, S.: Renewable Quantile Regression for Streaming Datasets. Knowl.-Based Syst. 235, 107675 (2022) Wang, K., Li, S.: Robust distributed modal regression for massive data. Comput. Stat. Data Anal. 160, 107225 (2021) Wang, J., Kolar, M., Srebro, N., Zhang, T.: Efficient distributed learning with sparsity. Int. Conf. Mach. Learn. 70, 3636–3645 (2017) Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B 68, 49–67 (2006) Zhang, Y., Duchi, J.C., Wainwright, M.: Communication-efficient algorithms for statistical optimization. J. Mach. Learn. Res. 14, 3321–3363 (2013) Zhang, Y., Duchi, J., Wainwright, M.: Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates. J. Mach. Learn. Res. 16, 3299–3340 (2015) Zhao, T., Cheng, G., Liu, H.: A partially linear framework for massive heterogeneous data. Ann. Stat. 44, 1400–1437 (2016) Zhang, C.: Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38, 894–942 (2010) Zhu, X., Li, F., Wang, H.: Least squares approximation for a distributed system (2019). arXiv: 1908.04904 Zou, H., Li, R.: One-step sparse estimates in nonconcave penalized likelihood models (with discussion). Ann. Stat. 36, 1509–1533 (2008)