GPU-based efficient join algorithms on Hadoop

Springer Science and Business Media LLC - Tập 77 - Trang 292-321 - 2020

Hongzhi Wang¹, Ning Li¹, Zheng Wang¹, Jianing Li¹

¹Harbin Institute of Technology, Harbin, China

Tóm tắt

The growing data have brought tremendous pressure for query processing and storage, so there are many studies that focus on using GPU to accelerate join operation, which is one of the most important operations in modern database systems. However, existing GPU acceleration join operation researches are not very suitable for the join operation on big data. Based on this, this paper speeds up nested loop join, hash join and theta join, combining Hadoop with GPU, which is also the first to use GPU to accelerate theta join. At the same time, after the data pre-filtering and pre-processing, using MapReduce and HDFS in Hadoop proposed in this paper, the larger data table can be handled, compared to existing GPU acceleration methods. Also with MapReduce in Hadoop, the algorithm proposed in this paper can estimate the number of results more accurately and allocate the appropriate storage space without unnecessary costs, making it more efficient. Experimental results show that comparing with GPU-based approach without Hadoop, our approach increases the speed by 1.5–2 times, and comparing with the Hadoop-based approaches without GPU, our approach increases the speed by 1.3–2 times.

Tài liệu tham khảo

Afrati FN, Stasinopoulos N, Ullman JD, Vasilakopoulos A (2015) Sharesskew: an algorithm to handle skew for joins in mapreduce. CoRR. arXiv:abs/1512.03921 Angstadt K, Harcourt E (2015) A virtual machine model for accelerating relational database joins using a general purpose GPU. In: Watson LT, Weinbub J, Sosonkina M, Thacker WI (eds) Proceedings of the Symposium on High Performance Computing, HPC 2015, Part of the 2015 Spring Simulation Multiconference, SpringSim ’15, Alexandria, VA, USA, 12–15 April 2015. SCS/ACM, pp 127–134 Augustyn DR, Warchal L (2014) GPU-accelerated method of query selectivity estimation for non equi-join conditions based on discrete fourier transform. In: Bassiliades N, Ivanovic M, Kon-Popovska M, Manolopoulos Y, Palpanas T, Trajcevski G, Vakali A (eds) New Trends in Database and Information Systems II–Selected papers of the 18th East European Conference on Advances in Databases and Information Systems and Associated Satellite Events, ADBIS 2014 Ohrid, Macedonia, 7–10 Sept 2014 Proceedings II, volume 312 of Advances in Intelligent Systems and Computing. Springer, pp 215–227 Becerra S, Becerra SE, Schaefer AC, McInerney J, Cheng P (2014) Executing database queries using multiple processors. US Patent 8,762,366 Christos B, Anastasios G (2017) GPU processing of theta-joins. Concurr Comput Pract Exp 29(18):e4194 Cruz MSH, Kozawa Y, Amagasa T, Kitagawa H (2015) GPU acceleration of set similarity joins. In: Chen Q, Hameurlain A, Toumani F, Wagner R, Decker H (eds) Database and Expert Systems Applications–26th International Conference, DEXA 2015, Valencia, Spain, 1–4 Sept 2015, Proceedings, Part I, vol 9261. Lecture Notes in Computer Science. Springer, pp 384–398 Csar T, Pichler R, Sallinger E, Savenkov V (2015) Using statistics for computing joins with mapreduce. In: Calì A, Vidal M-E (eds) Proceedings of the 9th Alberto Mendelzon International Workshop on Foundations of Data Management, Lima, Peru, 6–8 May 2015, volume 1378 of CEUR Workshop Proceedings. CEUR-WS.org Devarajan N, Navneeth S, Mohanavalli S (2013) GPU accelerated relational hash join operation. In: International Conference on Advances in Computing, Communications and Informatics, ICACCI 2013, Mysore, India, 22–25 Aug 2013. IEEE, pp 891–896 DeWitt DJ (1979) DIRECT—a multiprocessor organization for supporting relational database management systems. IEEE Trans Comput 28(6):395–406 Do J, Kee Y-S, Patel JM, Park C, Park K, DeWitt DJ (2013) Query processing on smart SSDs: opportunities and challenges. In: Ross KA, Srivastava D, Papadias D (eds) Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, 22–27 June 2013. ACM, pp 1221–1230 Gantz JF (2008) The diverse and exploding digital universe. An Idc White Paper Retrieved Gowanlock M, Karsin B (2019) Accelerating the similarity self-join using the GPU. J Parallel Distrib Comput 133:107–123 Gowanlock M, Karsin B (2019) GPU-accelerated similarity self-join for multi-dimensional data. In: Proceedings of the 15th International Workshop on Data Management on New Hardware, pp 1–9 Gubner T, Tomé D, Lang H, Boncz P (2019) Fluid co-processing: GPU bloom-filters for CPU joins. In: Proceedings of the 15th International Workshop on Data Management on New Hardware, pp 1–10 Guo C, Chen H, Zhang F, Li C (2019) Parallel hybrid join algorithm on GPU. In: 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS). IEEE, pp 1572–1579 Hassan MAH, Bamha M, Loulergue F (2014) Handling data-skew effects in join operations using mapreduce. In: Abramson D, Lees M, Krzhizhanovskaya VV, Dongarra JJ, Sloot PMA (eds) Proceedings of the International Conference on Computational Science, ICCS 2014, Cairns, Queensland, Australia, 10–12 June 2014, volume 29 of Procedia Computer Science. Elsevier, pp 145–158 He JL, Mian HB (2013) Revisiting co-processing for hash joins on the coupled CPU-GPU architecture. PVLDB 6(10):889–900 Hernández ÁB, Perez MS, Gupta S, Muntés-Mulero V (2017) Using machine learning to optimize parallelism in big data applications. Future Gener Comput Syst 86:1076–1092 Kaldewey T, Lohman GM, Müller R, Volk PB (2012) GPU join processing revisited. In: Chen S, Harizopoulos S (eds) Proceedings of the Eighth International Workshop on Data Management on New Hardware, DaMoN 2012, Scottsdale, AZ, USA, 21 May 2012. ACM, pp 55–62 Kamath SJ, Kajatheepan K, Keenleyside JD, Meraji SS (2018) Fast query processing in columnar databases with GPUs. US Patent 9,971,808 Koumarelas IK, Naskos A, Gounaris A (2014) Binary theta-joins using mapreduce: efficiency analysis and improvements. In: Selçuk Candan K, Amer-Yahia S, Schweikardt N, Christophides V, Leroy V (eds) Proceedings of the Workshops of the EDBT/ICDT 2014 Joint Conference (EDBT/ICDT 2014), Athens, Greece, 28 March 2014, volume 1133 of CEUR Workshop Proceedings, pp 6–9. CEUR-WS.org Krüger J, Kim C, Grund M, Satish N, Schwalb D, Chhugani J, Plattner H, Dubey P, Zeier A (2011) Fast updates on read-optimized databases using multi-core CPUs. PVLDB 5(1):61–72 Low BW, Ooi BY, Wong CS (2011) Scalability of database bulk insertion with multi-threading. In: Zain JM, Binti Wan Mohd WM, El-Qawasmeh E (eds) Software Engineering and Computer Systems—Second International Conference, ICSECS 2011, Kuantan, Pahang, Malaysia, June 27-29, 2011, Proceedings, Part III, volume 181 of Communications in Computer and Information Science. Springer, pp 151–162 Myung J, Shim J, Yeon J, Lee S (2016) Handling data skew in join algorithms using mapreduce. Expert Syst Appl 51:286–299 Okcan A, Riedewald M (2011) Processing theta-joins using mapreduce. In: Sellis TK, Miller RJ, Kementsietsidis A, Velegrakis Y (eds) Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2011, Athens, Greece, 12–16 June 2011. ACM, pp 949–960 Penar M, Wilczek A (2016) The design of the efficient theta-join in map-reduce environment. In: Kozielski S, Mrozek D, Kasprowski P, Malysiak-Mrozek B, Kostrzewa D (eds) Beyond Databases, Architectures and Structures. Advanced Technologies for Data Mining and Knowledge Discovery—12th International Conference, BDAS 2016, Ustroń, Poland, 31 May-3 June 2016, Proceedings, volume 613 of Communications in Computer and Information Science. Springer, pp 204–215 Pietron M, Russek P, Wiatr K (2013) Accelerating select where and select join queries on a GPU. Comput Sci (AGH) 14(2):243–252 Rui R, Li H, Tu Y-C (2015) Join algorithms on GPUs: a revisit after seven years. In: 2015 IEEE International Conference on Big Data, Big Data 2015, Santa Clara, CA, USA, 29 Oct–1 Nov, 2015. IEEE, pp 2541–2550 Silva V, Leite J, Camata JJ, de Oliveira D, Coutinho ALGA, Valduriez P, Mattoso M (2017) Raw data queries during data-intensive parallel workflow execution. Future Gener Comput Syst 75(Supplement C):402–422 Singaraju J, Thamarakuzhi A, Chandy JA (2015) Active storage networks: using embedded computation in the network switch for cluster data processing. Future Gener Comput Syst 45(Supplement C):149 Singh M, Leonhardi B (2011) Introduction to the IBM netezza warehouse appliance. In: Ng JW, Couturier C, Litoiu M, Stroulia E (eds) Center for Advanced Studies on Collaborative Research, CASCON ’11, Toronto, ON, Canada, 7–10 Nov 2011. IBM/ACM, pp 385–386 Sitaridi EA, Ross KA (2016) GPU-accelerated string matching for database applications. VLDB J 25(5):719–740 Teubner J, Müller R, Alonso G (2011) Frequent item computation on a chip. IEEE Trans Knowl Data Eng 23(8):1169–1181 Woods L, Teubner J, Alonso G (2011) Real-time pattern matching with FPGAs. In: Abiteboul S, Böhm K, Koch C, Tan K-L (eds) Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, 11–16 April 2011, Hannover, Germany. IEEE Computer Society, pp 1292–1295 Yan K, Zhu H (2013) Two MRJs for multi-way theta-join in mapreduce. In: Pathan M, Wei G, Fortino G (eds) Internet and Distributed Computing Systems—6th International Conference, IDCS 2013, Hangzhou, China, 28–30 Oct 2013, Proceedings, vol 8223. Lecture Notes in Computer Science. Springer, pp 321–332 Yuan T, Liu Z, Liu H (2016) Optimizing hash join with mapreduce on multi-core cpus. IEICE Trans 99–D(5):1316–1325 Yuan Y, Lee R, Zhang X (2013) The Yin and Yang of processing data warehousing queries on GPU devices. PVLDB 6(10):817–828 Zhang B, Wang X, Zheng Z (2017) The optimization for recurring queries in big data analysis system with mapreduce. Future Gener Comput Syst 87:549–556 Zhang C, Li J, Wu L, Lin M, Liu W (2012) SEJ: an even approach to multiway theta-joins using mapreduce. In: Liu J, Chen J, Xu G (eds) 2012 Second International Conference on Cloud and Green Computing, CGC 2012, Xiangtan, Hunan, China, 1–3 Nov 2012. IEEE, pp 73–80 Zhang X, Chen L, Wang M (2012) Efficient multi-way theta-join processing using mapreduce. PVLDB 5(11):1184–1195 Zhou G, Wang G (2015) GBFSJ: bloom filter star join algorithms on GPUs. In: 12th International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2015, Zhangjiajie, China, 15–17 Aug 2015. IEEE, pp 2427–2431 Zhou J, Ross KA (2002) Implementing database operations using SIMD instructions. In: Franklin MJ, Moon B, Ailamaki A (eds) Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, 3–6 June 2002. ACM, pp 145–156

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA