Distributed statistical optimization for non-randomly stored big data with application to penalized learning
Tóm tắt
Distributed optimization for big data has recently attracted enormous attention. However, the existing algorithms are all based on one critical randomness condition, i.e., the big data are randomly distributed on different machines. This is seldom in practice, and violating this condition can seriously degrade the estimation accuracy. To fix this problem, we propose a pilot dataset surrogate loss function based optimization framework, which can realize communication-efficient distributed optimization for non-randomly distributed big data. Furthermore, we also apply it to penalized high-dimensional sparse learning problems by combining it with the penalty functions. Theoretical properties and numerical results all confirm the good performance of the proposed methods.
Tài liệu tham khảo
Battey, H., Fan, J., Liu, H., Lu, J., Zhu, Z.: Distributed testing and estimation under sparse high dimensional models. Ann. Stat. 46, 1352–1382 (2018)
Chen, X., Xie, M.: A split-and-conquer approach for analysis of extraordinarily large data. Stat. Sin. 24, 1655–1684 (2014)
Chen, X., Liu, W., Zhang, Y.: Quantile regression under memory constraint. Ann. Stat. 47(6), 3244–3273 (2019)
Chen, L., Zhou, Y.: Quantile regression in big data: a divide and conquer based strategy. Comput. Stat. Data Anal. 144, 106892 (2020)
Chen, X., Liu, W., Mao, X., Yang, Z.: Distributed high-dimensional regression under a quantile loss function. J. Mach. Learn. Res. 21, 1–43 (2020)
Fan, J., Li, R.: Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 96, 1348–1360 (2001)
Fan, J., Peng, H.: Nonconcave penalized likelihood with a diverging number of parameters. Ann. Stat. 32, 928–961 (2004)
Fan, J., Wang, D., Wang, K., Zhu, Z.: Distributed estimation of principal eigenspaces (2017). arXiv: 1702.06488
Fan, J., Guo, Y., Wang, K.: Communication-efficient accurate statistical estimation (2019). arXiv: 1906.04870
Friedman, J., Hastie, T., Tibshirani, R.: Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010)
Gopal, S., Yang, Y.: Distributed training of large-scale logistic models. International Conference on Machine Learning, pp. 289–297 (2013)
Huang, C., Huo, X.: A distributed one-step estimator. Math. Program. 174, 41–76 (2019)
Jordan, M.I., Lee, J.D., Yang, Y.: Communication-efficient distributed statistical inference. J. Am. Stat. Assoc. 14, 668–681 (2019)
Lee, J., Sun, Y., Liu, Q., Taylor, J.: Communication-efficient sparse regression: a one-shot approach (2015). arXiv: 1503.04337
Lin, N., Xi, R.: Aggregated estimating equation estimation. Stat. Interface 4, 73–83 (2011)
Shamir, O., Srebro, N., Zhang, T.: Communication-efficient distributed optimization using an approximate newton-type method. Int. Conf. Mach. Learn. 32, 1000–1008 (2014)
Tibshirani, R.: Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B 58, 267–288 (1996)
Wang, K., Wang, H., Li, S.: Renewable Quantile Regression for Streaming Datasets. Knowl.-Based Syst. 235, 107675 (2022)
Wang, K., Li, S.: Robust distributed modal regression for massive data. Comput. Stat. Data Anal. 160, 107225 (2021)
Wang, J., Kolar, M., Srebro, N., Zhang, T.: Efficient distributed learning with sparsity. Int. Conf. Mach. Learn. 70, 3636–3645 (2017)
Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B 68, 49–67 (2006)
Zhang, Y., Duchi, J.C., Wainwright, M.: Communication-efficient algorithms for statistical optimization. J. Mach. Learn. Res. 14, 3321–3363 (2013)
Zhang, Y., Duchi, J., Wainwright, M.: Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates. J. Mach. Learn. Res. 16, 3299–3340 (2015)
Zhao, T., Cheng, G., Liu, H.: A partially linear framework for massive heterogeneous data. Ann. Stat. 44, 1400–1437 (2016)
Zhang, C.: Nearly unbiased variable selection under minimax concave penalty. Ann. Statist. 38, 894–942 (2010)
Zhu, X., Li, F., Wang, H.: Least squares approximation for a distributed system (2019). arXiv: 1908.04904
Zou, H., Li, R.: One-step sparse estimates in nonconcave penalized likelihood models (with discussion). Ann. Stat. 36, 1509–1533 (2008)