iiHadoop: an asynchronous distributed framework for incremental iterative computations
Tóm tắt
It is true that data is never static; it keeps growing and changing over time. New data is added and old data can either be modified or deleted. This incremental nature of data motivates the development of new systems to perform large-scale data computations incrementally. MapReduce was recently introduced to provide an efficient approach for handling large-scale data computations. Nevertheless, it turned to be inefficient in supporting the processing of small incremental data. While many previous systems have extended MapReduce to perform iterative or incremental computations, these systems are still inefficient and too expensive to perform large-scale iterative computations on changing data. In this paper, we present a new system called iiHadoop, an extension of Hadoop framework, optimized for incremental iterative computations. iiHadoop accelerates program execution by performing the incremental computations on the small fraction of data that is affected by changes rather than the whole data. In addition, iiHadoop improves the performance by executing iterations asynchronously, and employing locality-aware scheduling for the map and reduce tasks taking into account the incremental and iterative behavior. An evaluation for the proposed iiHadoop framework is presented using examples of iterative algorithms, and the results showed significant performance improvements over comparable existing frameworks.
Tài liệu tham khảo
Battré D, Ewen S, Hueske F, Kao O, Markl V, Warneke D. Nephele/pacts: a programming model and execution framework for web-scale analytical processing. Proceedings of the 1st ACM Symposium on cloud computing. New York: ACM; 2010. p. 119–30.
Dean J, Ghemawat S. Mapreduce: simplified data processing on large clusters. Commun ACM. 2008;51(1):107–13.
Isard M, Budiui M, Yu Y, Birrell A, Fetterly D. Dryad: distributed data-parallel programs from sequential building blocks. ACM SIGOPS Op Syst Rev. 2007;41(3):59–72.
Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G. Pregel: a system for large-scale graph processing. Proceedings of the 2010 ACM SIGMOD International Conference on management of data. New York: ACM; 2010. p. 135–46.
Yu Y, Isard M, Fetterly D, Budiu M, Erlingsson Ú, Gunda PK, Currey J. Dryadlinq: a system for general-purpose distributed data-parallel computing using a high-level language. Proceedings of the 8th USENIX Conference on operating systems design and implementation. Berkeley: USENIX Association; 2008. p. 1–14.
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I. Spark: cluster computing with working sets. Proceedings of the 2Nd USENIX Conference on hot topics in cloud computing. Berkeley: USENIX Association; 2010. p. 10.
Apache Hadoop. http://hadoop.apache.org/. Accessed 26 Dec 2016.
Brin S, Page L. The anatomy of a large-scale hypertextual web search engine. Comput Netw ISDN Syst. 1998;30(1–7):107–17.
Chu C, Kim SK, Lin YA, Yu Y, Bradski G, Ng AY, Olukotun K. Map-reduce for machine learning on multicore. Proceedings of the 19th International Conference on neural information processing systems. Cambridge: MIT Press; 2006. p. 281–8.
Cormen TH, Leiserso CE, Rivest RL, Stein C. Introduction to algorithms. 3rd ed. Cambridge: MIT Press; 2001.
Peng D, Dabek F. Large-scale incremental processing using distributed transactions and notifications. Proceedings of the 9th USENIX Conference on operating systems design and implementation. Berkeley: USENIX Association; 2010. p. 251–64.
Murray DG, McSherry F, Isaacs R, Isard M, Barham P, Abadi M. Naiad: a timely dataflow system. Proceedings of the Twenty-Fourth ACM Symposium on operating systems principles. New York: ACM; 2013. p. 439–55.
Bhatotia P, Wieder A, Rodrigues R, Acar UA, Pasquin R. Incoop: Mapreduce for incremental computations. In: Proceedings of the 2nd ACM Symposium on cloud computing. New York: ACM; 2011. p. 7:1–7:14.
Yan C, Yang X, Yu Z, Li M, Li X. Incmr: Incremental data processing based on mapreduce. In: 2012 IEEE Fifth International Conference on cloud computing. New York: IEEE; 2012. p. 534–41.
Zhang Y, Chen S. i2mapreduce: Incremental iterative mapreduce. In: Proceedings of the 2Nd International Workshop on cloud intelligence. New York: ACM; 2013. p. 3:1–3:4.
Zhang Y, Chen S, Wang Q, Yu G. i2mapreduce: incremental mapreduce for mining evolving big data. IEEE Trans Knowl Data Eng. 2015;27(7):1906–19.
Borthaku D. The hadoop distributed file system: architecture and design. Hadoop Project Website. 2007;11:1–14.
Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth S, Saha B, Curino C, O’Malley O, Radia S, Reed B, Baldeschwieler E. Apache hadoop yarn: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on cloud computing. ACM: New York; 2013. p. 5:1–5:16.
Iteration. https://en.wikipedia.org/wiki/Iteration. Accessed 2 Dec 2016.
Ekanayake J, Li H, Zhang B, Gunarathne T, Bae SH, Qiu J, Fox G. Twister: a runtime for iterative mapreduce. Proceedings of the 19th ACM International Symposium on high performance distributed computing. New York: ACM; 2010. p. 810–8.
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. Proceedings of the 9th USENIX Conference on networked systems design and implementation. Berkeley: USENIX Association; 2012. p. 2.
Bu Y, Howe B, Balazinska M, Ernst MD. The haloop approach to large-scale iterative data analysis. VLDB J. 2012;21(2):169–90.
Zhang Y, Gao Q, Gao L, Wang C. imapreduce: a distributed computing framework for iterative computation. J Grid Comput. 2012;10(1):47–68.
Elnikety E, Elsayed T, Ramadan HE. ihadoop: asynchronous iterations for mapreduce. Proceedings of the 2011 IEEE Third International Conference on cloud computing technology and science. Washington: IEEE; 2011. p. 81–90.
Zhang Y, Gao Q, Gao L, Wang C. Accelerate large-scale iterative computation through asynchronous accumulative updates. Proceedings of the 3rd Workshop on scientific cloud computing date. New York: ACM; 2012. p. 13–22.
Power R, Li J. Piccolo: building fast, distributed programs with partitioned tables. Proceedings of the 9th USENIX Conference on operating systems design and implementation. Berkeley: USENIX Association; 2010. p. 293–306.
Ewen S, Tzoumas K, Kaufmann M, Markl V. Spinning fast iterative data flows. Proc VLDB Endow. 2012;5(11):1268–79.
Li B, Mazur E, Diao Y, McGregor A, Shenoy P. A platform for scalable one-pass analytics using mapreduce. Proceedings of the 2011 ACM SIGMOD International Conference on management of data. New York: ACM; 2011. p. 985–96.
Condie T, Conway N, Alvaro P, Hellerstein JM, Elmeleegy K, Sears R. Mapreduce online. Proceedings of the 7th USENIX Conference on networked systems design and implementation. Berkeley: USENIX Association; 2010. p. 21–21.
Schildgen J, Jörg T, Hoffmann M, Deßloch S. Marimba: a framework for making mapreduce jobs incremental. In: Proceedings of the 2014 IEEE International Congress on big data. New York: IEEE; 2014. p. 128–35.
Zaharia M, Borthaku D, Sarma JS, Elmeleegy K, Shenker S, Stoica I. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. Proceedings of the 5th European Conference on computer systems. New York: ACM; 2010. p. 265–78.
Tang Z, Jiang L, Zhou J, Li K, Li K. A self-adaptive scheduling algorithm for reduce start time. Future Gener Comput Syst. 2015;43(C):51–60.
Langville AN, Meyer CD. Deeper inside pagerank. Internet Math. 2004;1(3):335–80.
Sneyers J, Schrijvers T, Demoen B. Dijkstra’s algorithm with fibonacci heaps: an executable description in chr. In: Proceedings of the 20th Workshop on logic programming; 2006. p. 182–91.
Kang U, Tsourakakis CE, Faloutsos C. Pegasus: a peta-scale graph mining system implementation and observations. Proceedings of the 2009 Ninth IEEE International Conference on data mining. Washington: IEEE; 2009. p. 229–38.
Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Comput Surv. 1999;31(3):264–323.
Xu R, Wunsch D. Survey of clustering algorithms. IEEE Trans Neural Netw. 2005;16(3):645–78.
ClueWeb09 Dataset. http://www.lemurproject.org/clueweb09/webGraph.php.
BigCross Dataset. http://www-old.cs.uni-paderborn.de/en/research-group/ag-bloemer/research/abgeschlossene/clustering/streamkmpp.html.