Using Big Data-machine learning models for diabetes prediction and flight delays analytics

Journal of Big Data - Tập 7 - Trang 1-18 - 2020
Thérence Nibareke1, Jalal Laassiri1
1Informatics Systems and Optimization Laboratory, Ibn Tofail University, Kenitra, Morocco

Tóm tắt

Nowadays large data volumes are daily generated at a high rate. Data from health system, social network, financial, government, marketing, bank transactions as well as the censors and smart devices are increasing. The tools and models have to be optimized. In this paper we applied and compared Machine Learning algorithms (Linear Regression, Naïve bayes, Decision Tree) to predict diabetes. Further more, we performed analytics on flight delays. The main contribution of this paper is to give an overview of Big Data tools and machine learning models. We highlight some metrics that allow us to choose a more accurate model. We predict diabetes disease using three machine learning models and then compared their performance. Further more we analyzed flight delay and produced a dashboard which can help managers of flight companies to have a 360° view of their flights and take strategic decisions. We applied three Machine Learning algorithms for predicting diabetes and we compared the performance to see what model give the best results. We performed analytics on flights datasets to help decision making and predict flight delays. The experiment shows that the Linear Regression, Naive Bayesian and Decision Tree give the same accuracy (0.766) but Decision Tree outperforms the two other models with the greatest score (1) and the smallest error (0). For the flight delays analytics, the model could show for example the airport that recorded the most flight delays. Several tools and machine learning models to deal with big data analytics have been discussed in this paper. We concluded that for the same datasets, we have to carefully choose the model to use in prediction. In our future works, we will test different models in other fields (climate, banking, insurance.).

Tài liệu tham khảo

Inoubli W, Aridhi S, Mezni H, Maddouri M, Mephu Nguifo E. An experimental survey on big data frameworks. Future Gener Comput Syst. 2018;86:546–64. Petrov M, Butakov N, Nasonov D, Melnik M. Adaptive performance model for dynamic scaling Apache Spark Streaming. Procedia Comput Sci. 2018;136:109–17. Brahmwar M, Kumar M, Sikka G. Tolhit—a scheduling algorithm for Hadoop Cluster. Procedia Comput Sci. 2016;89:203–8. Al-Saqqa S, Al-Naymat G, Awajan A. A large-scale sentiment data classification for online reviews under apache spark. Procedia Comput Sci. 2018;141:183–9. Zheng W, Qin Y, Bugingo E, Zhang D, Chen J. Cost optimization for deadline-aware scheduling of big-data processing jobs on clouds. Future Gener Comput Syst. 2018;82:244–55. Akhavan-Hejazi H, Mohsenian-Rad H. Power systems big data analytics: an assessment of paradigm shift barriers and prospects. Energy Rep. 2018;4:91–100. Uzunkaya C, Ensari T, Kavurucu Y. Hadoop ecosystem and its analysis on tweets. Procedia Soc Behav Sci. 2015;195:1890–7. Naik NS, Negi A, Anitha R. A data locality based scheduler to enhance MapReduce performance in heterogeneous environments. Future Gener Comput Syst. 2019;90:423–34. Sarumi OA, Leung CK, Adetunmbi AO. Spark-based data analytics of sequence motifs in large omics data. Procedia Comput Sci. 2018;126:596–605. Hernández ÁB, Perez MS, Gupta S, Muntés-Mulero V. Using machine learning to optimize parallelism in big data applications. Future Gener Comput Syst. 2018;86:1076–92. Hidalgo N, Rosas E, Vasquez C, Wladdimiro D. Measuring stream processing systems adaptability under dynamic workloads. Future Gener Comput Syst. 2018;88:413–23. Lu S, Wei X, Rao B, Tak B, Wang L, Wang L. LADRA: log-based abnormal task detection and root-cause analysis in big data processing with Spark. Future Gener Comput Syst. 2019;95:392–403. JayaLakshmi ANM, Krishna Kishore KV. Performance evaluation of DNN with other machine learning techniques in a cluster using Apache Spark and MLlib. J King Saud Univ Comput Inf Sci. 2018. https://doi.org/10.1016/j.jksuci.2018.09.022. Mahdavinejad MS, Rezvan M, Barekatain M, Adibi P, Barnaghi P, Sheth AP. Machine learning for internet of things data analysis: a survey. Digit Commun Netw. 2018;4(3):161–75. Rao Chandakanna V. REHDFS: a random read/write enhanced HDFS. J Netw Comput Appl. 2018;103:85–100. Landset S, Khoshgoftaar TM, Richter AN, Hasanin T. A survey of open source tools for machine learning with big data in the Hadoop ecosystem. J Big Data. 2(1). 2015. http://www.journalofbigdata.com/content/2/1/24. Subramaniyaswamy V, Vijayakumar V, Logesh R, Indragandhi V. Unstructured data analysis on big data using map reduce. Procedia Comput Sci. 2015;50:456–65. Raj P. The Hadoop ecosystem technologies and tools. In: Advances in computers, vol. 109. Elsevier; 2018. pp. 279–320. Mustafa S, Elghandour I, Ismail MA. A machine learning approach for predicting execution time of spark jobs. Alex Eng J. 2018;57(4):3767–78. Chambers B, Zaharia M. Spark: The definitive guide; 2018. p. 600. Carcillo F, Dal Pozzolo A, Le Borgne Y-A, Caelen O, Mazzer Y, Bontempi G. SCARFF: a scalable framework for streaming credit card fraud detection with spark. Inf Fusion. 2018;41:182–94. McDonald C. Getting started with Apache Spark from inception to production; 2018. p. 174. Garcia-Ceja E, Riegler M, Nordgreen T, Jakobsen P, Oedegaard KJ, Tørresen J. Mental health monitoring with multimodal sensing and machine learning: a survey. Pervasive Mob Comput. 2018;51:1–26. Sneha N, Gangil T. Analysis of diabetes mellitus for early prediction using optimal features selection. J Big Data. 2019;6(1):13. https://doi.org/10.1186/s40537-019-0175-6. Jayanthi N, Babu BV, Rao NS. Survey on clinical prediction models for diabetes prediction. J Big Data. 2017;4(1):26. https://doi.org/10.1186/s40537-017-0082-7. Farooq K, Hussain A. A novel ontology and machine learning driven hybrid cardiovascular clinical prognosis as a complex adaptive clinical system. Complex Adapt Syst Model. 2016;4(1):12. https://doi.org/10.1186/s40294-016-0023-x. Sternberg A, Soares J, Carvalho D, et al. A review on flight delay prediction. 2017. arXiv preprint arXiv:1703.06118. https://arxiv.org/abs/1703.06118. Chen J, Li M. Chained predictions of flight delay using machine learning. In: AIAA Scitech 2019 Forum. 2019. p. 1661. https://www.researchgate.net/publication/330185077. Zettam M, Laassiri J, Enneya N. A MapReduce-based Adjoint method for preventing brain disease. J Big Data. 2018. https://doi.org/10.1186/s40537-018-0136-5. Al-Zuabi IM, Jafar A, Aljoumaa K. Predicting customer’s gender and age depending on mobile phone data. J Big Data. 2019. https://doi.org/10.1186/s40537-019-0180-9. Dahdouh K, Dakkak A, Oughdir L, Ibriz A. Large-scale e-learning recommender system based on Spark and Hadoop. J Big Data. 2019. https://doi.org/10.1186/s40537-019-0169-4. Ed-daoudy A, Maalmi K. A new Internet of Things architecture for real-time prediction of various diseases using machine learning on big data environment. J Big Data. 2019;6(1):104. https://doi.org/10.1186/s40537-019-0271-7. Hosseinzadeh F, Kayvanjoo AH, Ebrahimi M, et al. Prediction of lung tumor types based on protein attributes by machine learning algorithms. SpringerPlus. 2013;2(1):238. Behera M, Fowler EE, Owonikoko TK, et al. Statistical learning methods as a preprocessing step for survival analysis: evaluation of concept using lung cancer data. Biomed Eng Online. 2011;10(1):97. Chakrabarty N. A data mining approach to flight arrival delay prediction for american airlines. 2019. arXiv preprint arXiv:1903.06740.