Recursive join processing in big data environment

Journal of Computer Science and Cybernetics - Tập 37 Số 2 - Trang 107--122 - 2021
Anh-Cang Phan1, Thanh-Ngoan Trieu2, Thuong-Cang Phan2
1Vinh Long University of Technology and Education, 73 Nguyen Hue Street, Ward 2, Vinh Long City, Vinh Long Province, Viet Nam
2Can Tho University, 3/2 Street, Ninh Kieu District, Can Tho city, Viet Nam

Tóm tắt

In the era of information explosion, Big data is receiving increased attention as having important implications for growth, profitability, and survival of modern organizations. However, it also offers many challenges in the way data is processed and queried over time. A join operation is one of the most common operations appearing in many data queries. Specially, a recursive join is a join type used to query hierarchical data but it is more extremely complex and costly. The evaluation of the recursive join in MapReduce includes some iterations of two tasks of a join task and an incremental computation task. Those tasks are significantly expensive and reduce the performance of queries in large datasets because they generate plenty of intermediate data transmitting over the network. In this study, we thus propose a simple but efficient approach for Big recursive joins based on reducing by half the number of the required iterations in the Spark environment. This improvement leads to significantly reducing the number of the required tasks as well as the amount of the intermediate data generated and transferred over the network. Our experimental results show that an improved recursive join is more efficient and faster than a traditional one on large-scale datasets.

Từ khóa

#Apache spark #big data #recursive join #optimize three-way join

Tài liệu tham khảo