Improving lookup and query execution performance in distributed Big Data systems using Cuckoo Filter

Journal of Big Data - Tập 9 - Trang 1-30 - 2022
Sharafat Ibn Mollah Mosharraf1, Muhammad Abdullah Adnan1,2
1Department of Computer Science & Engineering, Bangladesh University of Engineering & Technology (BUET), Dhaka, Bangladesh
2Department of Computer Science & Engineering, University of California San Diego, California, USA

Tóm tắt

Performance is a critical concern when reading and writing data from billions of records stored in a Big Data warehouse. We introduce two scopes for query performance improvement. One is to improve the performance of lookup queries after data deletion in Big Data systems that use Eventual Consistency. We propose a scheme to improve lookup performance after data deletion by using Cuckoo Filter. Another scope for improvement is to avoid unnecessary network round-trips for querying in remote nodes in a distributed Big Data cluster when it is known that the nodes do not have requested partition of data. We propose a scheme using probabilistic filters that are looked up before querying remote nodes so that queries resulting in no data can be skipped from passing through the network. We evaluate our schemes with Cassandra using real dataset and show that each scheme can improve performance of lookup queries for up to 2x.

Tài liệu tham khảo

Vagata P, Wilfong K. Scaling the Facebook data warehouse to 300PB (April 2014). https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb Accessed 30 Dec 2019. Dittrich J, Quiané-Ruiz J-A, Jindal A, Kargin Y, Setty V, Schad J. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). Proc VLDB Endow. 2010;3(1–2):515–29. https://doi.org/10.14778/1920841.1920908. Richter S, Quiané-Ruiz J-A, Schuh S, Dittrich J. Towards zero-overhead static and Adaptive Indexing in Hadoop. VLDB J. 2014;23(3):469–94. https://doi.org/10.1007/s00778-013-0332-z. Schuhknecht FM, Dittrich J, Linden L. Adaptive Adaptive Indexing. In: ICDE; 2018. Bloom BH. Space/time trade-offs in hash coding with allowable errors. Commun ACM. 1970;13(7):422–6. https://doi.org/10.1145/362686.362692. Herodotou H, Babu S. Profiling, what-if analysis, and cost-based optimization of MapReduce programs. PVLDB. 2011;4(11):1111–22. Jahani E, Cafarella MJ, Ré C. Automatic optimization for MapReduce programs. PVLDB. 2011;4(6):385–96. Todorov Marinov M. A bloom filter application for processing big datasets through mapreduce framework. In: 2021 International Conference on Information Technologies (InfoTech), pp. 1–5;2021. https://doi.org/10.1109/InfoTech52438.2021.9548638. Nehme R, Bruno N. Automated partitioning design in Parallel Database Systems. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. SIGMOD ’11, pp. 1137–1148. ACM, New York, NY, USA;2011. https://doi.org/10.1145/1989323.1989444. Ameloot TJ, Geck G, Ketsman B, Neven F, Schwentick T. Data partitioning for single-round multi-join evaluation in massively parallel systems. SIGMOD Rec. 2016;45(1):33–40. https://doi.org/10.1145/2949741.2949750. Lu Y, Shanbhag A, Jindal A, Madden S. AdaptDB: Adaptive partitioning for distributed joins. Proc VLDB Endow. 2017;10(5):589–600. https://doi.org/10.14778/3055540.3055551. Chang F, Dean J, Ghemawat S, Hsieh WC, Wallach DA, Burrows M, Chandra T, Fikes A, Gruber RE. Bigtable: A distributed storage system for structured data. ACM Trans Comput Syst. 2008;26(2):4–1426. https://doi.org/10.1145/1365815.1365816. Bhushan M, Banerjea S, Yadav SK. Bloom filter based optimization on HBase with MapReduce. In: Data Mining and Intelligent Computing, IEEE, pp. 1–5;2014. https://doi.org/10.1109/ICDMIC.2014.6954230. Lakshman A, Malik P. Cassandra: a decentralized structured storage system. ACM SIGOPS Oper Syst Rev. 2010;44(2):35–40. Fan B, Andersen DG, Kaminsky M, Mitzenmacher MD. Cuckoo Filter: Practically better than Bloom. In: Proceedings of the 10th ACM International on Conference on Emerging Networking Experiments and Technologies. CoNEXT ’14, pp. 75–88. ACM, New York, NY, USA;2014. https://doi.org/10.1145/2674005.2674994. Frazier J. Why Data Deletion Makes Sense (and Dollars). http://www.ftijournal.com/article/why-data-deletion-makes-sense-and-dollars Accessed 15 Aug 2018. Lee J-G, Han J, Li X, Gonzalez H. TraClass: Trajectory classification using hierarchical region-based and trajectory-based clustering. Proc VLDB Endow. 2008;1(1):1081–94. https://doi.org/10.14778/1453856.1453972. Zamanian E, Binnig C, Salama A. Locality-aware partitioning in parallel database systems. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. SIGMOD ’15, pp. 17–30. ACM, New York, NY, USA 2015. https://doi.org/10.1145/2723372.2723718. Hentschel B, Kester MS, Idreos S. Column sketches: A scan accelerator for rapid and robust predicate evaluation. In: Proceedings of the 2018 International Conference on Management of Data. SIGMOD ’18, pp. 857–872. ACM, New York, NY, USA;2018. https://doi.org/10.1145/3183713.3196911. http://doi.acm.org/10.1145/3183713.3196911. Sidirourgos L, Kersten M. Column imprints: A secondary index structure. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. SIGMOD ’13, pp. 893–904. ACM, New York, NY, USA;2013. https://doi.org/10.1145/2463676.2465306. http://doi.acm.org/10.1145/2463676.2465306. Tian Y, Zou T, Ozcan F, Goncalves R, Pirahesh H. Joins for hybrid warehouses: Exploiting massive parallelism in hadoop and enterprise data warehouses. In: Proceedings of the 18th International Conference on Extending Database Technology, EDBT 2015, Brussels, Belgium, March 23-27, 2015., pp. 373–384;2015. https://doi.org/10.5441/002/edbt.2015.33. Polychroniou O, Sen R, Ross KA. Track join: Distributed joins with minimal network traffic. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. SIGMOD ’14, pp. 1483–1494. ACM, New York, NY, USA;2014. https://doi.org/10.1145/2588555.2610521. Vogels W. Eventually consistent. Commun ACM. 2009;52(1):40–4. He Y, Lee R, Huai Y, Shao Z, Jain N, Zhang X, Xu Z. RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems. In: 2011 IEEE 27th International Conference on Data Engineering, pp. 1199–1208;2011. https://doi.org/10.1109/ICDE.2011.5767933. Chen S. Cheetah: a high performance, custom data warehouse on top of MapReduce. Proc VLDB Endow. 2010;3(1–2):1459–68. https://doi.org/10.14778/1920841.1921020. Tarkoma S, Rothenberg CE, Lagerspetz E. Theory and practice of Bloom filters for distributed systems. IEEE Commun Surv Tutorials. 2012;14(1):131–55. https://doi.org/10.1109/SURV.2011.031611.00024. Liang Y, Yu Y, Ouyang W. Improved big data filtering algorithm based on bloom filter. J Phys. 2020;1629(1):012026. https://doi.org/10.1088/1742-6596/1629/1/012026. Kwon M, Reviriego P, Pontarelli S. A length-aware Cuckoo filter for faster IP lookup. In: 2016 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pp. 1071–1072;2016. https://doi.org/10.1109/INFCOMW.2016.7562258. Cui J, Zhang J, Zhong H, Xu Y. SPACF: a secure privacy-preserving authentication scheme for VANET with Cuckoo filter. IEEE Trans Vehicular Technol. 2017;66(11):10283–95. https://doi.org/10.1109/TVT.2017.2718101. Mahale VV, Pareek NP, Uttarwar VU. Alleviation of DDoS attack using advance technique. In: 2017 International Conference on Innovative Mechanisms for Industry Applications (ICIMIA), pp. 172–176;2017. https://doi.org/10.1109/ICIMIA.2017.7975595. Al-Hisnawi M, Ahmadi M. Deep packet inspection using Cuckoo filter. In: 2017 Annual Conference on New Trends in Information Communications Technology Applications (NTICT), pp. 197–202;2017. https://doi.org/10.1109/NTICT.2017.7976111. Kwon M, Vajpayee S, Vijayaragavan P, Dhuliya A, Marshall J. Use of Cuckoo filters with FD.Io VPP for software IPv6 routing lookup. In: Proceedings of the SIGCOMM Posters and Demos, pp. 127–129. ACM, ???;2017. https://doi.org/10.1145/3123878.3132010. Ren K, Zheng Q, Arulraj J, Gibson G. SlimDB: A space-efficient key-value storage engine for semi-sorted data. Proc VLDB Endow. 2017;10(13):2037–48. https://doi.org/10.14778/3151106.3151108. Xue Q, Chuah MC. Cuckoo-filter based privacy-aware search over encrypted cloud data. In: 2015 11th International Conference on Mobile Ad-hoc and Sensor Networks (MSN), pp. 60–68;2015. https://doi.org/10.1109/MSN.2015.41. Chen H, Liao L, Jin H, Wu J. The Dynamic Cuckoo Filter. In: 2017 IEEE 25th International Conference on Network Protocols (ICNP), pp. 1–10;2017. https://doi.org/10.1109/ICNP.2017.8117563. Sun Y, Hua Y, Jiang S, Li Q, Cao S, Zuo P. SmartCuckoo: A fast and cost-efficient hashing index scheme for cloud storage systems. In: Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference. USENIX ATC ’17, pp. 553–565. USENIX Association, Berkeley, CA, USA;2017. Kirsch A, Mitzenmacher M, Wieder U. More robust hashing: Cuckoo hashing with a stash. SIAM J Comput. 2009;39(4):1543–61. https://doi.org/10.1137/080728743. Mitzenmacher M, Pontarelli S, Reviriego P. Adaptive Cuckoo filters. In: Proceedings of the Twentieth Workshop on Algorithm Engineering and Experiments, pp. 36–47;2018. https://doi.org/10.1137/1.9781611975055.4. Casares J. Multi-datacenter Replication in Cassandra. (November 2012). https://www.datastax.com/blog/2012/11/multi-datacenter-replication-cassandra Accessed Dec 30;2019.