Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systems

Lijie Xu1, Shuang Qiu2, Binhang Yuan2, Jiawei Jiang3, Cédric Renggli1, Shaoduo Gan1, Kaan Kara1, Guoliang Li4, Ji Liu5, Wentao Wu6, Jieping Ye7, Ce Zhang8
1ETH Zürich, Zürich, Switzerland
2Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong
3Wuhan University, Wuhan, China
4Tsinghua University, Beijing, China
5Meta, Menlo Park, USA
6Microsoft Research, Washington, USA
7University of Michigan, Ann Arbor, USA
8University of Chicago, Chicago, USA

Tóm tắt

AbstractModern machine learning (ML) systems commonly use stochastic gradient descent (SGD) to train ML models. However, SGD relies on random data order to converge, which usually requires a full data shuffle. For in-DB ML systems and deep learning systems with large datasets stored on block-addressable secondary storage such as HDD and SSD, this full data shuffle leads to low I/O performance—the data shuffling time can be even longer than the training itself, due to massive random data accesses. To balance the convergence rate of SGD (which favors data randomness) and its I/O performance (which favors sequential access), previous work has proposed several data shuffling strategies. In this paper, we first perform an empirical study on existing data shuffling strategies, showing that these strategies suffer from either low performance or low convergence rate. To solve this problem, we propose a simple but novel two-level data shuffling strategy named , which can avoid a full data shuffle while maintaining comparable convergence rate of SGD as if a full shuffle were performed. We further theoretically analyze the convergence behavior of and empirically evaluate its efficacy in both in-DB ML and deep learning systems. For in-DB ML systems, we integrate into PostgreSQL by introducing three new physical operators with optimizations. For deep learning systems, we extend single-process to multi-process for the parallel/distributed environment and integrate it into PyTorch. Our evaluation shows that can achieve comparable convergence rate with the full-shuffle-based SGD for both linear models and deep learning models. For in-DB ML with linear models, is 1.6$$\times $$ × $$-$$ - 12.8$$\times $$ × faster than two state-of-the-art systems, Apache MADlib and Bismarck, on both HDD and SSD. For deep learning models on ImageNet, is 1.5$$\times $$ × faster than PyTorch with full data shuffle.

Từ khóa


Tài liệu tham khảo

Buffer Manager of PostgreSQL. https://www.interdb.jp/pg/pgsql08.html (2022)

Abadi, M., Barham, P., Chen, J., et al.: Tensorflow: A system for large-scale machine learning. In: 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, pp. 265–283. USENIX Association (2016)

Paszke, A., Gross, S., Massa, F., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, pp. 8024–8035 (2019)

Hellerstein, J.M., Ré, C., Schoppmann, F., Wang, D.Z., Fratkin, E., Gorajek, A., Ng, K.S., Welton, C., Feng, X., Li, K., Kumar, A.: The madlib analytics library or MAD skills, the SQL. Proc. VLDB Endow. 5(12), 1700–1711 (2012)

Feng, X., Kumar, A., Recht, B., Ré, C.: Towards a unified architecture for in-rdbms analytics. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD ’12, pp. 325–336 (2012)

Schleich, M., Olteanu, D., Ciucanu, R.: Learning linear regression models over factorized joins. In: Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, pp. 3–18. ACM (2016)

Kumar, A., Naughton, J.F., Patel, J.M.: Learning generalized linear models over normalized data. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1969–1984. ACM (2015)

Olteanu, D., Schleich, M.: F: regression models over factorized views. Proc. VLDB Endow. 9(13), 1573–1576 (2016)

Chen, L., Kumar, A., Naughton, J.F., Patel, J.M.: Towards linear algebra over normalized data. Proc. VLDB Endow. 10(11), 1214–1225 (2017)

Rendle, S.: Scaling factorization machines to relational data. Proc. VLDB Endow. 6(5), 337–348 (2013)

Khamis, M.A., Ngo, H.Q., Nguyen, X., Olteanu, D., Schleich, M.: In-database learning with sparse tensors. In: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pp. 325–340. ACM (2018)

Luo, S., Gao, Z.J., Gubanov, M.N., Perez, L.L., Jankov, D., Jermaine, C.M.: Scalable linear algebra on a relational database system. Commun. ACM 63(8), 93–101 (2020)

Apache MADlib: big data machine learning in SQL. http://madlib.apache.org/

Zhang, C., Ré, C.: Dimmwitted: a study of main-memory statistical analytics. Proc. VLDB Endow. 7(12) (2014)

Kara, K., Eguro, K., Zhang, C., Alonso, G.: Columnml: column-store machine learning with on-the-fly data transformation. Proc. VLDB Endow. 12(4), 348–361 (2018)

Petersen, T.K.: Inside the lustre file system. SEAGATE Technology paper (2015)

Sliding-Window Shuffle in TensorFlow. https://www.tensorflow.org/api_docs/python/tf/data/Dataset

PostgreSQL. https://www.postgresql.org/

Graefe, G.: Volcano—an extensible and parallel query evaluation system. IEEE Trans. Knowl. Data Eng. 6(1), 120–135 (1994)

Xu, L., Qiu, S., Yuan, B., Jiang, J., Renggli, C., Gan, S., Kara, K., Li, G., Liu, J., Wu, W., Ye, J., Zhang, C.: In-database machine learning with corgipile: stochastic gradient descent without full data shuffle. In: SIGMOD ’22: International Conference on Management of Data, Philadelphia, PA, USA, June 12–17, 2022, pp. 1286–1300. ACM (2022)

ImageNet Dataset. https://www.image-net.org/

Bottou, L.: Stochastic gradient descent tricks. In: Neural Networks: Tricks of the Trade, pp. 421–436. Springer (2012)

De Sa, C.M.: Random reshuffling is not always better. Adv. Neural Inf. Process. Syst. 33, 5957–5967 (2020)

Yun, C., Sra, S., Jadbabaie, A.: Open problem: Can single-shuffle SGD be better than reshuffling SGD and gd? In: Conference on Learning Theory, COLT 2021, Proceedings of Machine Learning Research, vol. 134, pp. 4653–4658. PMLR (2021)

Jankov, D., Yuan, B., Luo, S., Jermaine, C.: Distributed numerical and machine learning computations via two-phase execution of aggregated join trees. Proc. VLDB Endow. 14(7), 1228–1240 (2021)

Luo, S., Jankov, D., Yuan, B., Jermaine, C.: Automatic optimization of matrix implementations for distributed machine learning and linear algebra. In: Proceedings of the 2021 International Conference on Management of Data, pp. 1222–1234 (2021)

Yuan, B., Jankov, D., Zou, J., Tang, Y., Bourgeois, D., Jermaine, C.: Tensor relational algebra for distributed machine learning system design. Proc. VLDB Endow. 14(8), 1338–1350 (2021)

Jasny, M., Ziegler, T., Kraska, T., Röhm, U., Binnig, C.: DB4ML—an in-memory database kernel with machine learning support. In: Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, pp. 159–173. ACM (2020)

Ruder, S.: An overview of gradient descent optimization algorithms. CoRR abs/1609.04747 (2016)

The optimization algorithms in PyTorch. https://pytorch.org/docs/stable/optim.html

The optimization algorithms in TensorFlow. https://www.tensorflow.org/api_docs/python/tf/keras/optimizers

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: 3rd International Conference on Learning Representations, ICLR 2015 (2015)

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, pp. 770–778 (2016)

LIBSVM Data. https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

CIFAR-10 Dataset. http://www.cs.toronto.edu/~kriz/cifar.html

Arpaci-Dusseau, R.H., Arpaci-Dusseau, A.C.: Operating Systems: Three Easy Pieces, 1.00 edn. Arpaci-Dusseau Books (2018)

PyTorch’s shuffling strategy. https://pytorch.org/data/main/generated/torchdata.datapipes.iter.Shuffler.html (2024)

Bottou, L.: Curiously fast convergence of some stochastic gradient descent algorithms. In: Proceedings of the Symposium on Learning and Data Science, Paris, vol. 8, pp. 2624–2633 (2009)

Gürbüzbalaban, M., Ozdaglar, A., Parrilo, P.A.: Why random reshuffling beats stochastic gradient descent. Math. Program. pp. 1–36 (2019)

HaoChen, J.Z., Sra, S.: Random shuffling beats SGD after finite epochs. In: Proceedings of the 36th International Conference on Machine Learning, ICML 2019. In: Proceedings of Machine Learning Research, vol. 97, pp. 2624–2633. PMLR (2019)

Shamir, O.: Without-replacement sampling for stochastic gradient methods. In: Advances in Neural Information Processing Systems, pp. 46–54 (2016)

Gürbüzbalaban, M., Ozdaglar, A.E., Parrilo, P.A.: Why random reshuffling beats stochastic gradient descent. Math. Program. 186(1), 49–84 (2021)

Ying, B., Yuan, K., Vlaski, S., Sayed, A.H.: Stochastic learning under random reshuffling with constant step-sizes. IEEE Trans. Signal Process. 67(2), 474–489 (2019)

Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. SIAM Rev. 60(2), 223–311 (2018)

Liu, J., Zhang, C.: Distributed learning systems with first-order methods. Found. Trends Databases 9(1), 1–100 (2020)

Xu, L., Qiu, S., Yuan, B., Jiang, J., Renggli, C., Gan, S., Kara, K., Li, G., Liu, J., Wu, W., Ye, J., Zhang, C.: Stochastic gradient descent without full data shuffle. CoRR abs/2206.05830 (2022). https://doi.org/10.48550/arXiv.2206.05830

Getting Started With Distributed Data Parallel. https://pytorch.org/tutorials/intermediate/ddp_tutorial.html

Hadoop HDFS Architecture. https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html

Amazon Elastic Block Store (EBS). https://aws.amazon.com/ebs

The Lustre file system. https://www.lustre.org/

ETH Euler Cluster. https://scicomp.ethz.ch/wiki/Euler

Lustre reads/writes data in blocks. https://scicomp.ethz.ch/wiki/Conda

The TFRecord format for storing a sequence of binary records. https://www.tensorflow.org/tutorials/load_data/tfrecord

TFRecord format for PyTorch. https://github.com/vahidk/tfrecord

MADlib-Deep Learning. https://madlib.apache.org/docs/latest/group__grp__keras.html (2024)

Thomee, B., Shamma, D.A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., Li, L.: YFCC100M: the new data in multimedia research. Commun. ACM 59(2), 64–73 (2016)

ImageNet training in PyTorch. https://github.com/pytorch/examples/tree/main/imagenet

PostgreSQL TOAST. https://www.postgresql.org/docs/9.5/storage-toast.html

Epsilon Dataset. https://www.k4all.org/project/large-scale-learning-challenge/

Bandwidth Bench. https://github.com/JerryLead/bandwidth-bench (2024)

Best practices on Lustre parallel file systems. https://scicomp.ethz.ch/wiki/Best_practices_on_Lustre_parallel_file_systems

Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: 19th International Conference on Computational Statistics, COMPSTAT 2010, pp. 177–186 (2010)

Moulines, E., Bach, F.R.: Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In: Advances in Neural Information Processing Systems, pp. 451–459 (2011)

Ghadimi, S., Lan, G.: Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23(4), 2341–2368 (2013)

Polyak, B.T.: Gradient methods for minimizing functionals. Zhurnal Vychislitel’noi Matematiki i Matematicheskoi Fiziki 3(4), 643–653 (1963)

Haas, P.J., Koenig, C.: A bi-level Bernoulli scheme for database sampling. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, 2004, pp. 275–286. ACM (2004)

Cheng, Y., Zhao, W., Rusu, F.: Bi-level online aggregation on raw data. In: Proceedings of the 29th International Conference on Scientific and Statistical Database Management, Chicago, IL, USA, June 27–29, 2017, pp. 10:1–10:12. ACM (2017)

MacGregor, J.: Predictive Analysis with SAP. Galileo Press, Bonn (2013)

Oracle R Enterprise Versions of R Models. https://docs.oracle.com/cd/E11882_01/doc.112/e36761/orelm.htm

The Greenplum MADlib extension. https://greenplum.docs.pivotal.io/6-19/analytics/madlib.html

Fard, A., Le, A., Larionov, G., Dhillon, W., Bear, C.: Vertica-ml: distributed machine learning in vertica database. In: Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, pp. 755–768. ACM (2020)

Google BigQuery ML. https://cloud.google.com/bigquery-ml/docs/introduction

Microsoft SQL Server Machine Learning Services. https://docs.microsoft.com/en-us/sql/machine-learning/sql-server-machine-learning-services?view=sql-server-ver15

Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., Cho, H., et al.: Xgboost: extreme gradient boosting. R Package Version 0.4–2 1(4), 1–4 (2015)

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.Y.: Lightgbm: a highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 30, 3146–3154 (2017)

Meng, X., Bradley, J.K., Yavuz, B., et al.: Mllib: machine learning in apache spark. J. Mach. Learn. Res. 17, 34:1-34:7 (2016)

Zhang, Z., Jiang, J., Wu, W., Zhang, C., Yu, L., Cui, B.: Mllib*: fast training of glms using spark mllib. In: 35th IEEE International Conference on Data Engineering, ICDE 2019, pp. 1778–1789. IEEE (2019)

Cai, Z., Vagena, Z., Perez, L.L., Arumugam, S., Haas, P.J., Jermaine, C.M.: Simulation of database-valued Markov chains using simsql. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, pp. 637–648. ACM (2013)

Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Ranzato, M., Senior, A., Tucker, P., Yang, K., et al.: Large scale distributed deep networks. Adv. Neural Inf. Process. Syst. 25, 1223–1231 (2012)

Xing, E.P., Ho, Q., Dai, W., Kim, J.K., Wei, J., Lee, S., Zheng, X., Xie, P., Kumar, A., Yu, Y.: Petuum: a new platform for distributed machine learning on big data. IEEE Trans. Big Data 1(2), 49–67 (2015)

Jiang, J., Cui, B., Zhang, C., Yu, L.: Heterogeneity-aware distributed parameter servers. In: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 463–478 (2017)

Lian, X., Zhang, C., Zhang, H., Hsieh, C.J., Zhang, W., Liu, J.: Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 5336–5346 (2017)

Tang, H., Lian, X., Yan, M., Zhang, C., Liu, J.: D$${}^{\text{2}}$$: Decentralized training over decentralized data. In: Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Proceedings of Machine Learning Research, vol. 80, pp. 4855–4863. PMLR (2018)

Zhang, Y., Mcquillan, F., Jayaram, N., Kak, N., Khanna, E., Kislal, O., Valdano, D., Kumar, A.: Distributed deep learning on data systems: a comparative analysis of approaches. Proc. VLDB Endow. 14(10), 1769–1782 (2021)

Nakandala, S., Zhang, Y., Kumar, A.: Cerebro: a data system for optimized deep learning model selection. Proc. VLDB Endow. 13(11), 2159–2173 (2020)