HaRD: a heterogeneity-aware replica deletion for HDFS

Journal of Big Data - Tập 6 - Trang 1-21 - 2019
Hilmi Egemen Ciritoglu1, John Murphy1, Christina Thorpe2
1Performance Engineering Laboratory, School of Computer Science, University College Dublin, Dublin, Ireland
2Technological University Dublin, Dublin, Ireland

Tóm tắt

The Hadoop distributed file system (HDFS) is responsible for storing very large data-sets reliably on clusters of commodity machines. The HDFS takes advantage of replication to serve data requested by clients with high throughput. Data replication is a trade-off between better data availability and higher disk usage. Recent studies propose different data replication management frameworks that alter the replication factor of files dynamically in response to the popularity of the data, keeping more replicas for in-demand data to enhance the overall performance of the system. When data gets less popular, these schemes reduce the replication factor, which changes the data distribution and leads to unbalanced data distribution. Such an unbalanced data distribution causes hot spots, low data locality and excessive network usage in the cluster. In this work, we first confirm that reducing the replication factor causes unbalanced data distribution when using Hadoop’s default replica deletion scheme. Then, we show that even keeping a balanced data distribution using WBRD (data-distribution-aware replica deletion scheme) that we proposed in previous work performs sub-optimally on heterogeneous clusters. In order to overcome this issue, we propose a heterogeneity-aware replica deletion scheme (HaRD). HaRD considers the nodes’ processing capabilities when deleting replicas; hence it stores more replicas on the more powerful nodes. We implemented HaRD on top of HDFS and conducted a performance evaluation on a 23-node dedicated heterogeneous cluster. Our results show that HaRD reduced execution time by up to 60%, and 17% when compared to Hadoop and WBRD, respectively.

Tài liệu tham khảo

Sakr S, Liu A, Batista DM, Alomari M. A survey of large scale data management approaches in cloud environments. IEEE Commun Surv Tutor. 2011;13(3):311–36. Sohangir S, Wang D, Pomeranets A, Khoshgoftaar TM. Big data: deep learning for financial sentiment analysis. J Big Data. 2018;5(1):3. Tsai CW, Lai CF, Chao HC, Vasilakos AV. Big data analytics: a survey. J Big data. 2015;2(1):21. Apache Hadoop. http://hadoop.apache.org (2018). Accessed 27 June 2019. Dean J, Ghemawat S. Mapreduce: simplified data processing on large clusters. Commun ACM. 2008;51(1):107–13. Shvachko K, Kuang H, Radia S, Chansler R. The hadoop distributed file system. In: 2010 IEEE 26th symposium on mass storage systems and technologies (MSST). New York: IEEE; 2010. p. 1–10. Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth S, et al. Apache hadoop yarn: yet another resource negotiator. In: Proceedings of the 4th annual symposium on cloud computing. New York: ACM; 2013. p. 5. Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ, et al. Apache spark: a unified engine for big data processing. Commun ACM. 2016;59(11):56–65. Ciritoglu HE, Batista de Almeida L, Cunha de Almeida E, Buda TS, Murphy J, Thorpe C. Investigation of replication factor for performance enhancement in the hadoop distributed file system. In: Companion of the 2018 ACM/SPEC international conference on performance engineering. New York: ACM; 2018. p. 135–40. Mazumdar S, Seybold D, Kritikos K, Verginadis Y. A survey on data storage and placement methodologies for cloud-big data ecosystem. J Big Data. 2019;6(1):15. Ananthanarayanan G, Agarwal S, Kandula S, Greenberg A, Stoica I, Harlan D, Harris E. Scarlett: coping with skewed content popularity in mapreduce clusters. In: Proceedings of the sixth conference on computer systems. New York: ACM; 2011. p. 287–300. Wei Q, Veeravalli B, Gong B, Zeng L, Feng D. Cdrm: a cost-effective dynamic replication management scheme for cloud storage cluster. In: 2010 IEEE international conference on cluster computing (CLUSTER). New York: IEEE; 2010. p. 188–96. Abad CL, Lu Y, Campbell RH. Dare: adaptive data replication for efficient cluster scheduling. In: 2011 IEEE international conference on cluster computing (CLUSTER). New York: IEEE; 2011. p. 159–68. Cheng Z, Luan Z, Meng Y, Xu Y, Qian D, Roy A, Zhang N, Guan G. Erms: an elastic replication management system for hdfs. In: 2012 IEEE international conference on cluster computing workshops (CLUSTER WORKSHOPS). New York: IEEE; 2012. p. 32–40. Eltabakh MY, Tian Y, Özcan F, Gemulla R, Krettek A, McPherson J. Cohadoop: flexible data placement and its exploitation in hadoop. VLDB Endow. 2011;4(9):575–85. Milani BA, Navimipour NJ. A systematic literature review of the data replication techniques in the cloud environments. Big Data Res. 2017;10:1–7. Ciritoglu HE, Saber T, Buda TS, Murphy J, Thorpe C. Towards a better replica management for hadoop distributed file system. In: 2018 IEEE international congress on Big Data (BigData Congress). New York: IEEE; 2018. p. 104–11. Zaharia M, Konwinski A, Joseph AD, Katz RH, Stoica I. Improving MapReduce performance in heterogeneous environments. In: Osdi, 2008; 8:7. Pluggable interface for block placement of hadoop. https://issues.apache.org/jira/browse/HDFS-385 (2014). Accessed 27 June 2019. Zaharia M, Borthakur D, Sen Sarma J, Elmeleegy K, Shenker S, Stoica I. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: EuroSys. New York: ACM; 2010. p. 265–78. Ren K, Kwon Y, Balazinska M, Howe B. Hadoop’s adolescence: an analysis of hadoop usage in scientific workloads. Proc VLDB Endow. 2013;6(10):853–64. Lin Y, Shen H. Eafr: an energy-efficient adaptive file replication system in data-intensive clusters. IEEE Trans Parallel Distrib Syst. 2017;28(4):1017–30. Xie J, Yin S, Ruan X, Ding Z, Tian Y, Majors J, Manzanares A, Qin X. Improving mapreduce performance through data placement in heterogeneous hadoop clusters. In: 2010 IEEE international symposium on parallel & distributed processing, workshops and Phd forum (IPDPSW). New York: IEEE; 2010. p. 1–9. Lee CW, Hsieh KY, Hsieh SY, Hsiao HC. A dynamic data placement strategy for hadoop in heterogeneous environments. Big Data Res. 2014;1:14–22. Liao J, Cai Z, Trahay F, Peng X. Block placement in distributed file systems based on block access frequency. IEEE Access. 2018;6:38411–20. Islam NS, Lu X, Wasi-ur Rahman M, Shankar D, Panda DK. Triple-h: a hybrid approach to accelerate HDFS on HPC clusters with heterogeneous storage architecture. In: 2015 15th IEEE/ACM international symposium on cluster, cloud and grid computing. New York: IEEE; 2015. p. 101–10. Krish K, Anwar A, Butt AR. hats: a heterogeneity-aware tiered storage for hadoop. In: 2014 14th IEEE/ACM international symposium on cluster, cloud and grid computing. New York: IEEE; 2014. p. 502–11. Jalaparti V, Douglas C, Ghosh M, Agrawal A, Floratou A, Kandula S, Menache I, Naor JS, Rao S. Netco: Cache and i/o management for analytics over disaggregated stores. In: Proceedings of the ACM symposium on cloud computing. New York: ACM; 2018. p. 186–98. Yarn container configuration. https://hortonworks.com/blog/how-to-plan-and-configure-yarn-in-hdp-2-0/ (2013). Accessed 27 June 2019. Using GPU On YARN. https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/UsingGpus.html (2018). Accessed 27 June 2019. Yarn Tunning. https://www.cloudera.com/documentation/enterprise/5-3-x/topics/cdh_ig_yarn_tuning.html (2018). Accessed 27 June 2019. Massie ML, Chun BN, Culler DE. The ganglia distributed monitoring system: design, implementation, and experience. Parallel Comput. 2004;30(7):817–40. Huang S, Huang J, Dai J, Xie T, Huang B. The hibench benchmark suite: characterization of the mapreduce-based data analysis. In: 2010 IEEE 26th international conference on data engineering workshops (ICDEW 2010). New York: IEEE; 2010. p. 41–51. Ahmad F, Lee S, Thottethodi M, Vijaykumar T. Puma: Purdue MapReduce benchmarks suite 2012. Chen Y, Alspaugh S, Katz R. Interactive analytical processing in big data systems: a cross-industry study of mapreduce workloads. Proc VLDB Endow. 2012;5(12):1802–13. Costa E, Costa C, Santos MY. Evaluating partitioning and bucketing strategies for hive-based big data warehousing systems. J Big Data. 2019;6(1):34. Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R. Hive: a warehousing solution over a map-reduce framework. Proc VLDB Endow. 2009;2(2):1626–29. Poess M, Floyd C. New TPC benchmarks for decision support and web commerce. ACM Sigmod Rec. 2000;29:64–71. Bittorf M, Bobrovytsky T, Erickson C, Hecht MGD, Kuff M, Leblang DKA, Robinson N, Rus DRS, Wanderman JRDTS, Yoder MM. Impala: a modern, open-source SQL engine for hadoop. In: Proceedings of the 7th biennial conference on innovative data systems research; 2015. Costea A, Ionescu A, Răducanu B, Switakowski M, Bârca C, Sompolski J, Łuszczak A, Szafrański M, De Nijs G, Boncz P. Vectorh: taking SQL-on-hadoop to the next level. In: SIGMOD/PODS. New York: ACM; 2016. p. 1105–17. Floratou A, Minhas UF, Özcan F. SQL-on-hadoop: full circle back to shared-nothing database architectures. Proc VLDB Endow. 2014;7(12):1295–306.