A Survey on Data-driven Performance Tuning for Big Data Analytics Platforms

Big Data Research - Tập 25 - Trang 100206 - 2021
Rogério Luís de C. Costa1,2, José Moreira2,3, Paulo Pintor3, Veronica dos Santos4, Sérgio Lifschitz4
1Computer Science and Communication Research Centre (CIIC), Polytechnic of Leiria - Leiria - 2411-901, Portugal
2Institute of Electronics and Informatics Engineering (IEETA), University of Aveiro - Aveiro - 3810-193, Portugal
3Department of Eletronics, Telecommunications and Informatics (DETI), University of Aveiro - Aveiro - 3810-193, Portugal
4Departamento de Informática, Pontifícia Universidade Católica do Rio de Janeiro (PUC-Rio), RJ - 22451-900, Brazil

Tài liệu tham khảo

Abadi, 2020, The Seattle report on database research, SIGMOD Rec., 48, 44, 10.1145/3385658.3385668 Moorthy, 2015, Big Data: prospects and challenges, Vikalpa, 40, 74, 10.1177/0256090915575450 Sivarajah, 2017, Critical analysis of Big Data challenges and analytical methods, J. Bus. Res., 70, 263, 10.1016/j.jbusres.2016.08.001 Wang, 2019, Modeling and building iot data platforms with actor-oriented databases, 512 Arvanitis, 2019, Automated performance management for the big data stack Navaz, 2018, Towards an efficient and energy-aware mobile big health data architecture, Comput. Methods Programs Biomed., 166, 137, 10.1016/j.cmpb.2018.10.008 Rasmussen, 2012, Themis: an I/O-efficient MapReduce, 1 Zhang, 2018, Riffle: optimized Shuffle service for large-scale data, 1 Lu, 2018, Speedup your analytics: automatic parameter tuning for databases and big data systems, Proc. VLDB Endow., 12, 1970, 10.14778/3352063.3352112 Herodotou, 2011, Starfish: a self-tuning system for Big Data analytics, 261 Chen, 2012, Interactive analytical processing in big data systems, Proc. VLDB Endow., 5, 1802, 10.14778/2367502.2367519 Shah, 2015, Investigating an ontology-based approach for Big Data analysis of inter-dependent medical and oral health conditions, Clust. Comput., 18, 351, 10.1007/s10586-014-0406-8 Riahi, 2018, Big Data and Big Data analytics: concepts, types and technologies, Int. J. Res. Eng., 5, 524, 10.21276/ijre.2018.5.9.5 Ularu, 2012, Perspectives on Big Data and Big Data analytics, Database Syst. J., 3, 3 Jin, 2015, Significance and challenges of Big Data research, Big Data Res., 2, 59, 10.1016/j.bdr.2015.01.006 Özcan, 2017, Hybrid transactional/analytical processing: a survey, 1771 Abadi, 2016, Beckman report on database research, Commun. ACM, 59, 92, 10.1145/2845915 The Apache Software Foundation Thusoo, 2009, Hive: a warehousing solution over a map-reduce framework, Proc. VLDB Endow., 2, 1626, 10.14778/1687553.1687609 Kornacker, 2015, Impala: a modern, open-source SQL engine for Hadoop, 1406 Armbrust, 2015, Spark SQL: relational data processing in spark, 1383 Corbellini, 2017, Persisting big-data: the NoSQL landscape, Inf. Syst., 63, 1, 10.1016/j.is.2016.07.009 Cattell, 2010, Scalable SQL and NoSQL data stores, SIGMOD Rec., 39, 12, 10.1145/1978915.1978919 Tudorica, 2011, A comparison between several NoSQL databases with comments and notes Hecht, 2011, Nosql evaluation: a use case oriented survey, 336 Stefani, 2018, Implementing triple-stores using NoSQL databases, CEUR Workshop Proc., 2280, 86 Kabakus, 2017, A performance evaluation of in-memory databases, J. King Saud Univ, Comput. Inf. Sci., 29, 520 Li, 2018, Flutedb: an efficient and scalable in-memory time series database for sensor-cloud, J. Parallel Distrib. Comput., 122, 95, 10.1016/j.jpdc.2018.07.021 Arulraj, 2017, How to build a non-volatile memory database management system, 1753 Petrov, 2018, Hardware-assisted transaction processing: NVM, 1 Kim, 2019, A scalable and persistent key-value store using non-volatile memory, 464 Tommasini, 2019, An outlook to declarative languages for big steaming data, 199 Aldinucci, 2020, Data stream processing in HPC systems: new frameworks and architectures for high-frequency streaming, Parallel Comput., 98, 10.1016/j.parco.2020.102694 Cheng, 2019, Auto-scaling for real-time stream analytics on HPC cloud, Serv. Oriented Comput. Appl., 13, 169, 10.1007/s11761-019-00262-0 Barba-González, 2020, On the design of a framework integrating an optimization engine with streaming technologies, Future Gener. Comput. Syst., 107, 538, 10.1016/j.future.2020.02.020 Bergamaschi, 2017, Bigbench workload executed by using apache flink, Proc. Manuf., 11, 695 Hiraman, 2018, A study of apache Kafka in Big Data stream processing, 2018 Khiati, 2018, Stream processing engines for smart healthcare systems, 467 Persico, 2018, Benchmarking big data architectures for social networks data processing using public cloud platforms, Future Gener. Comput. Syst., 89, 98, 10.1016/j.future.2018.05.068 Psomakelis, 2020, Context agnostic trajectory prediction based on λ-architecture, Future Gener. Comput. Syst., 110, 531, 10.1016/j.future.2019.09.046 Kiran, 2015, Lambda architecture for cost-effective batch and speed big data processing, 2785 Persico, 2018, Benchmarking big data architectures for social networks data processing using public cloud platforms, Future Gener. Comput. Syst., 89, 98, 10.1016/j.future.2018.05.068 Shah, 2017, Towards development of spark based agricultural information system including geo-spatial data, 3476 Wolfert, 2017, Big Data in smart farming – a review, Agric. Syst., 153, 69, 10.1016/j.agsy.2017.01.023 Atluri, 2018, Spatio-temporal data mining: a survey of problems and methods, ACM Comput. Surv., 51, 10.1145/3161602 Yang, 2019, Big spatiotemporal data analytics: a research and innovation frontier, Int. J. Geogr. Inf. Sci., 1 Subbu, 2017, Big Data for context aware computing – perspectives and challenges, Big Data Res., 10, 33, 10.1016/j.bdr.2017.10.002 Wang, 2019, An integrated GIS platform architecture for spatiotemporal big data, Future Gener. Comput. Syst., 94, 160, 10.1016/j.future.2018.10.034 Chauhan, 2016, Using big data analytics for developing crime predictive model, 1 Ullah, 2019, Architectural tactics for Big Data cybersecurity analytics systems: a review, J. Syst. Softw., 151, 81, 10.1016/j.jss.2019.01.051 Li, 2019, PIM-WEAVER: a high energy-efficient, general-purpose acceleration architecture for string operations in Big Data processing, Sustain. Comput. Inf. Sci., 21, 129 Lnenicka, 2019, Developing a government enterprise architecture framework to support the requirements of big and open linked data with the use of cloud computing, Int. J. Inf. Manag., 46, 124, 10.1016/j.ijinfomgt.2018.12.003 Zhang, 2017, A big data analytics architecture for cleaner manufacturing and maintenance processes of complex products, J. Clean. Prod., 142, 626, 10.1016/j.jclepro.2016.07.123 Fahmideh, 2019, Big data analytics architecture design—an application in manufacturing systems, Comput. Ind. Eng., 128, 948, 10.1016/j.cie.2018.08.004 Pfeiffer, 2015, Spatial and temporal epidemiological analysis in the Big Data era, Prev. Vet. Med., 122, 213, 10.1016/j.prevetmed.2015.05.012 Spangenberg, 2017, A Big Data architecture for intra-surgical remaining time predictions, Proc. Comput. Sci., 113, 310, 10.1016/j.procs.2017.08.332 Manogaran, 2018, A new architecture of Internet of things and big data ecosystem for secured smart healthcare monitoring and alerting system, Future Gener. Comput. Syst., 82, 375, 10.1016/j.future.2017.10.045 Sakr, 2016, Towards a comprehensive data analytics framework for smart healthcare services, Big Data Res., 4, 44, 10.1016/j.bdr.2016.05.002 Ghani, 2019, Social media big data analytics: a survey, Comput. Hum. Behav., 101, 417, 10.1016/j.chb.2018.08.039 Guo, 2018, Learning to route with sparse trajectory sets, 1085 Snowdon, 2018, Spatiotemporal traffic volume estimation model based on GPS samples, 1 Neilson, 2019, Systematic review of the literature on Big Data in the transportation domain: concepts and applications, Big Data Res., 17, 35, 10.1016/j.bdr.2019.03.001 Balduini, 2019, Models and practices in Urban data science at scale, Big Data Res., 17, 66, 10.1016/j.bdr.2018.04.003 Silva, 2020, Integration of Big Data analytics embedded smart city architecture with RESTful web of things for efficient service provision and energy management, Future Gener. Comput. Syst., 107, 975, 10.1016/j.future.2017.06.024 Roriz Junior, 2019, Mensageria: a smart city framework for real-time analysis of traffic data streams, big social data and Urban computing (BiDU@VLDB2018 workshop) extended version, Commun. Comput. Inf. Sci., 926, 59 Ghazal, 2013, BigBench: towards an industry standard benchmark for big data analytics, 1197 Wang, 2014, BigDataBench: a big data benchmark suite from Internet services, 488 Ming, 2014, BDGS: a scalable big data generator suite in big data benchmarking, 138 Huang, 2010, The HiBench benchmark suite: characterization of the MapReduce-based data analysis, 41 Ahmad, 2012 Cooper, 2010, Benchmarking cloud serving systems with YCSB, 143 Li, 2017, Sparkbench: a spark benchmarking suite characterizing large-scale in-memory data analytics, Clust. Comput., 20, 2575, 10.1007/s10586-016-0723-1 Li, 2015, SPARKBENCH: a comprehensive benchmarking suite for in memory data analytic platform spark Lu, 2014, Stream bench: towards benchmarking modern distributed stream computing frameworks, 69 Han, 2018, Benchmarking Big Data systems: a review, IEEE Trans. Serv. Comput., 11, 580, 10.1109/TSC.2017.2730882 Pagliari, 2019, Towards a high-level description for generating stream processing benchmark applications, 3711 Ceesay, 2017, Plug and play bench: simplifying big data benchmarking using containers, 2821 Zaharia, 2016, Apache spark: a unified engine for Big Data processing, Commun. ACM, 59, 56, 10.1145/2934664 Santos, 2017, Evaluating SQL-on-Hadoop for Big Data warehousing on not-so-good hardware, 242 Sethi, 2019, Presto: SQL on everything, 1802 Hausenblas, 2013, Apache drill: interactive ad-hoc analysis at scale, Big Data, 1, 100, 10.1089/big.2013.0011 Costa, 2019, Evaluating partitioning and bucketing strategies for hive-based Big Data Warehousing systems, J. Big Data, 6, 34, 10.1186/s40537-019-0196-1 O'neil, 2009 Mehta, 2017, Comparative evaluation of big-data systems on scientific image analytics workloads, Proc. VLDB Endow., 10, 1226, 10.14778/3137628.3137634 Brown, 2010, Overview of sciDB: large scale array storage, processing and analysis, 963 Halperin, 2014, Demonstration of the Myria big data management service, 881 Abadi Chaudhuri, 2005, Foundations of automated database tuning, 964 Abouzeid, 2009, Hadoopdb: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads, Proc. VLDB Endow., 2, 922, 10.14778/1687627.1687731 Chaudhuri, 2007, Self-tuning database systems: a decade of progress, 3 Chaudhuri, 2006, Foundations of automated database tuning, 1265 Almeida, 2019, An ontological perspective for database tuning heuristics, 240 Noon, 2016, Automated performance tuning of data management systems with materializations and indices, J. Comput. Commun., 04, 47, 10.4236/jcc.2016.45007 Ameri, 2016, On a self-tuning index recommendation approach for databases, 201 Curino, 2010, Schism: a workload-driven approach to database replication and partitioning, Proc. VLDB Endow., 3, 48, 10.14778/1920841.1920853 Zhao, 2012, Application-managed database replication on virtualized cloud environments, 127 Borovica-Gajić, 2016, Cheap data analytics using cold storage devices, Proc. VLDB Endow., 9, 1029, 10.14778/2994509.2994521 Sanders, 2001, Denormalization effects on performance of RDBMS Chaudhuri, 1997, An overview of data warehousing and OLAP technology, SIGMOD Rec., 26, 65, 10.1145/248603.248616 Rangel, 2006, Least likely to use: a new page replacement strategy for improving database management system response time, 514 Thakare, 2019, Probabilistic page replacement policy in buffer cache management for flash-based cloud databases, Comput. Inform., 38, 1237, 10.31577/cai_2019_6_1237 Lu, 2019, Speedup your analytics: automatic parameter tuning for databases and big data systems, Proc. VLDB Endow., 12, 1970, 10.14778/3352063.3352112 Li, 2019, Qtune: a query-aware database tuning system with deep reinforcement learning, Proc. VLDB Endow., 12, 2118, 10.14778/3352063.3352129 Zheng, 2014, Self-tuning performance of database systems with neural network, 1 Aken, 2017, Automatic database management system tuning through large-scale machine learning, 1009 Zhang, 2019, An end-to-end automatic cloud database tuning system using deep reinforcement learning, 415 Davoudian, 2018, A survey on NoSQL stores, ACM Comput. Surv., 51, 10.1145/3158661 Guzmán, 2019, Creation of a distributed NoSQL database with distributed hash tables, 26 Bloom, 1970, Space/time trade-offs in hash coding with allowable errors, Commun. ACM, 13, 422, 10.1145/362686.362692 Chevalier, 2016, Document-oriented models for data Warehouses - NoSQL document-oriented for data Warehouses, 142 Bansal, 2014, A framework for performance analysis and tuning in Hadoop based clusters, 1 Lee, 2008, A case for flash memory SSD in enterprise database applications, 1075 Bakratsas, 2018, Hadoop MapReduce performance on SSDs for analyzing social networks, Big Data Res., 11, 1, 10.1016/j.bdr.2017.06.001 Moon, 2015, Optimizing the Hadoop MapReduce framework with high-performance storage devices, J. Supercomput., 71, 3525, 10.1007/s11227-015-1447-3 Krish, 2014, Venu: orchestrating SSDs in Hadoop storage, 207 Wu, 2013, Understanding the impacts of solid-state storage on the Hadoop performance, 125 Ren, 2018, File system performance tuning for standard Big Data benchmarks, 22 Torabzadehkashi, 2019, Computational storage: an efficient and scalable platform for big data and HPC applications, J. Big Data, 6, 10.1186/s40537-019-0265-5 Haas, 2016, An MPSoC for energy-efficient database query processing, 1 Balkesen, 2018, RAPID: in-memory analytical query processing engine with extreme performance perWatt, 1407 Rao, 2012, Sailfish: a framework for large scale data processing, 1 Kumar, 2016, Performance analysis of MySQL partition, hive partition-bucketing and apache pig, 1 Koliopoulos, 2016, Towards automatic memory tuning for in-memory Big Data analytics in clusters, 353 Aziz, 2019, Leveraging resource management for efficient performance of Apache Spark, J. Big Data, 6, 78, 10.1186/s40537-019-0240-1 Gounaris, 2018, A methodology for Spark parameter tuning, Big Data Res., 11, 22, 10.1016/j.bdr.2017.05.001 Ptiček, 2017, Big Data and new data Warehousing approaches, 6 Zdravevski, 2019, Cluster-size optimization within a cloud-based ETL framework for Big Data, 3754 Costa, 2018, Evaluating several design patterns and trends in Big Data Warehousing systems, 459 de Carvalho Costa, 2006, Data warehouses in grids with high QoS, vol. 4081, 207 Furtado, 2009, Efficient and robust node-partitioned data Warehouses, 658 Wu, 2013, A self-tuning system based on application profiling and performance analysis for optimizing Hadoop MapReduce cluster configuration, 89 Alipourfard, 2017 Zhu, 2017, BestConfig: tapping the performance potential of systems via automatic configuration tuning, 338 Bao, 2018, Learning-based automatic parameter tuning for Big Data analytics frameworks, 181 Berral, 2015, ALOJA-ML: a framework for automating characterization and knowledge discovery in Hadoop deployments, 1701 Tariq, 2019, Modelling and prediction of resource utilization of Hadoop clusters, 93 Wang, 2016, A novel method for tuning configuration parameters of spark based on machine learning, 586