SparkBench: a spark benchmarking suite characterizing large-scale in-memory data analytics

Min Li1, Jian Tan2, Yandong Wang1, Li Zhang1, Valentina Salapura1
1IBM Almaden Research Center, San Jose, USA
2Ohio State University, Columbus, USA

Tóm tắt

Từ khóa


Tài liệu tham khảo

Agrawal, D., Butt, A., Kshitij, D., Larriba-Pey, J.-L., Li, M., Reiss, F.R., Raab, F., Schiefer, B., Xia, Y.: Sparkbench: a spark performance testing suite. In Proceedings of TPCTC (2015)

Amazon Movie Review. http://snap.stanford.edu/data/web-Movies.html

AMPLab Big Data Benchmark. https://amplab.cs.berkeley.edu/benchmark/

Apache GridMix. http://hadoop.apache.org/docs/r1.2.1/gridmix.html

Apache Spark. http://spark.apache.org/

Armstrong, T.G., Ponnekanti, V., Borthakur, D., Callaghan, M.: Linkbench: a database benchmark based on the facebook social graph. In Proceedings of the 2013 ACM SIGMOD, pp. 1185–1196 (2013)

Avery, C.: Giraph: large-scale graph processing infrastructure on hadoop. In: Proceedings of the Hadoop Summit, Santa Clara (2011)

Batarfi, O., El Shawi, R., Fayoumi, A.G., Nouri, R., Barnawi, A., Sakr, S., et al.: Large scale graph processing systems: survey and an experimental evaluation. Clust. Comput. 18(3), 1189–1213 (2015)

Chaimov, N., Malony, A., Canon, S., Iancu, C., Ibrahim, K.Z., Srinivasan, J.: Scaling spark on HPC systems. In: HPDC ’16, pp. 97–110. ACM, New York (2016)

Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R., Sears, R.: Benchmarking cloud serving systems with YCSB. In: Proceedings of the 1st ACM SOCC, pp. 143–154 (2010)

Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

Ferdman, M., Adileh, A., Kocberber, O., Volos, S., Alisafaee, M., Jevdjic, D., Kaynak, C., Popescu, A.D., Ailamaki, A., Falsafi, B.: Clearing the clouds: a study of emerging scale-out workloads on modern hardware. In: Proceedings of the 17th ACM ASPLOS, pp. 37–48 (2012)

Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.-A.: Bigbench: towards an industry standard benchmark for big data analytics. In: Proc of ACM SIGMOD (2013)

Google Web Graph. http://snap.stanford.edu/data/web-Google.html

Hu, Y., Koren, Y., Volinsky, C.: Collaborative filtering for implicit feedback datasets. In: Proceedings of the 8th IEEE ICDM (2008)

Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The hibench benchmark suite: characterization of the mapreduce-based data analysis. In 26th IEEE ICDEW, pp. 41–51 (2010)

IBM. Big Data and Analytics Hub. http://www.ibmbigdatahub.com/infographic/four-vs-big-data

IBM SoftLayer. http://www.softlayer.com/

James, G., Witten, D., Hastie, T., Tibshirani, R.: An Introduction to Statistical Learning. Springer, New York (2013)

Kolountzakis, M.N., Miller, G.L., Peng, R., Tsourakakis, C.E.: Efficient triangle counting in large graphs via degree-based vertex partitioning. Internet Math. 8(1–2), 161–185 (2012)

Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model. In: Proceedings of ACM SIGKDD (2008)

Kryo: a fast and efficient Object Graph Serialization Framework for Java. https://github.com/EsotericSoftware/kryo

Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: Sparkbench: a comprehensive benchmarking suite for in memory data analytic platform spark. In: Proceedings of Workshop on Analytics Platforms for the Cloud (2015)

Ming, Z., Luo, C., Gao, W., Han, R., Yang, Q., Wang, L., Zhan, J.: Bdgs: a scalable big data generator suite in big data benchmarking. In: Advancing Big Data Benchmarks, pp. 138–154. Springer, New York (2014)

Nyberg, C., Shah, M., Govindaraju, N.: Sort Benchmark. http://sortbenchmark.org/

Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., Chun, B.-G., VICSI: Making sense of performance in data analytics frameworks. In: Proceedings of USENIX NSDI (2015)

Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web. Technical Report 1999-66, Stanford InfoLab (1999)

Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: Proceedings of ACM SIGMOD (2009)

Peng, J., Choo, K.-K.R., Ashman, H.: Bit-level n-gram based forensic authorship analysis on social media: identifying individuals from linguistic profiles. J. Netw. Comput. Appl. 70, 171–182 (2016)

pigmix. Apache PigMix. https://cwiki.apache.org/confluence/display/PIG/PigMix

Powered By Spark. https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

Quick, D., Choo, K.-K.R.: Big forensic data reduction: digital forensic images and electronic evidence. Clust. Comput. 19(2), 723–740 (2016)

Shi, J., Qui, Y., Minhas, U.F., Jiao, L., Wang, C., Reinwald, B., Ozcan, F.: Clash of the titans: mapreduce vs. spark for large scale data analytics. In: Proceedings of the VLDB Endowment (2015)

Spark Technology Center. https://github.com/SparkTC

SparkBench: A Comprehensive Spark Benchmarking Suite, Anonymized for double blind review. https://goo.gl/woHxxK

Spark-perf:Spark performance tests. https://github.com/databricks/spark-perf

TPC-DS. http://www.tpc.org/tpcds/

TPC-H. http://www.tpc.org/tpch/

Twitter4j: a Java Library for the Twitter API. http://twitter4j.org

Wang, L., Zhan, J., Luo, C., Zhu, Y., Yang, Q., He, Y., Gao, W., Jia, Z., Shi, Y., Zhang, S., Zheng, C., Lu, G., Zhan, K., Li, X., Qiu, B.: BigDataBench. http://prof.ict.ac.cn/BigDataBench/

Wang, L., Zhan, J., Luo, C., Zhu, Y., Yang, Q., He, Y., Gao, W., Jia, Z., Shi, Y., Zhang, S., Zheng, C., Lu, G., Zhan, K., Li, X., Qiu, B.: Bigdatabench: a big data benchmark suite from internet services. In: IEEE 20th HPCA, pp. 488–499 (2014)

Wikipedia Data Dumps. http://dumps.wikimedia.org/enwiki/

WikiXMLJ. https://code.google.com/p/wikixmlj/

Xiong, W., Yu, Z., Bei, Z., Zhao, J., Zhang, F., Zou, Y., Bai, X., Li, Y., Xu, C.: A characterization of big data benchmarks. In: IEEE International Conference on Big Data, pp. 118–125 (2013)

Xu, Z., Luo, X., Liu, Y., Choo, K.K.R., Sugumaran, V., Yen, N., Mei, L., Hu, C.: From latency, through outbreak, to decline: detecting different states of emergency events using web resources. IEEE Trans. Big Data PP(99):1–1 (2016)

Xu, Z., Xuan, J., Liu, Y., Choo, K.-K.R., Mei, L., Hu, C.: Building spatial temporal relation graph of concepts pair using web repository. In: Information Systems Frontiers, pp. 1–10 (2016)

Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX NSDI, Berkeley, CA (2012)

Zhang, F., Liu, M., Gui, F., Shen, W., Shami, A., Ma, Y.: A distributed frequent itemset mining algorithm using spark for big data analytics. Clust. Comput. 18(4), 1493–1501 (2015)

Zhu, J., Xu, C., Li, Z., Fung, G., Lin, X., Huang, J., Huang, C.: An examination of on-line machine learning approaches for pseudo-random generated data. Clust. Comput. 19(3), 1309–1321 (2016)