Experimenting sensitivity-based anonymization framework in apache spark
Tóm tắt
One of the biggest concerns of big data and analytics is privacy. We believe the forthcoming frameworks and theories will establish several solutions for the privacy protection. One of the known solutions is the k-anonymity that was introduced for traditional data. Recently, two major frameworks leveraged big data processing and applications; these are MapReduce and Spark. Spark data processing has been attracting more attention due to its crucial impacts on a wide range of big data applications. One of the predominant big data applications is data analytics and anonymization. We previously proposed an anonymization method for implementing k-anonymity in MapReduce processing framework. In this paper, we investigate Spark performance in processing data anonymization. Spark is a fast processing framework that was implemented in several applications such as: SQL, multimedia, and data stream. Our focus is the SQL Spark, which is adequate for big data anonymization. Since Spark operates in-memory, we need to observe its limitations, speed, and fault tolerance on data size increase, and to compare MapReduce to Spark in processing anonymity. Spark introduces an abstraction called resilient distributed datasets, which reads and serializes a collection of objects partitioned across a set of machines. Developers claim that Spark can outperform MapReduce by 10 times in iterative machine learning jobs. Our experiments in this paper compare between MapReduce and Spark. The overall results show a better performance for Spark’s processing time in anonymity operations. However, in some limited cases, we prefer to implement the old MapReduce framework, when the cluster resources are limited and the network is non-congested.
Tài liệu tham khảo
Chen H, Chiang RH, Storey VC. Business intelligence and analytics: From big data to big impact. MIS quarterly. 2012;36:1165–88.
Guller M. Big data analytics with spark. A practitioner’s guide to using spark for large scale data analysis. The expert’s voice in Spark. Berkeley: Apress; 2015. p. 2015. https://doi.org/10.1007/978-1-4842-0964-6.
Kirk DB, Wen-Mei WH. Programming massively parallel processors: a hands-on approach. Morgan kaufmann. 2016
Govindaraju V. Big data analytics. Oxford: Elsevier Science; 2015. p. 2015.
Shi J, Qiu Y, Minhas UF, Jiao L, Wang C, Reinwald B, Özcan F. Clash of the titans: mapreduce vs. spark for large scale data analytics. Proc VLDB Endow. 2015;8:2110–21.
Motwani R, Xu Y. Efficient algorithms for masking and finding quasi-identifiers. Proc Conf Very Large Data Bases. 2007;2007:83–93.
Sweeney L. Achieving-anonymity privacy protection using generalization and suppression international journal of uncertainty. Fuzz Knowl Based Syst. 2002;10:571–88. https://doi.org/10.1142/S021848850200165X.
Ke Wang PS, Yu S, Chakraborty S. Bottom–up generalization: a data mining solution to privacy protection. USA. 2004. https://doi.org/10.1109/icdm.2004.10110.
Fung BCM, Wang K, Yu PS. Top-down specialization for information and privacy preservation. USA. 2005. https://doi.org/10.1109/icde.2005.143.
Irudayasamy A, Arockiam L. Parallel bottom–up generalization approach for data anonymization using map reduce for security of data in public cloud Indian. J Sci Technol. 2015;8:1. https://doi.org/10.17485/ijst/2015/v8i22/79095.
Irudayasamy A, Arockiam L. Scalable multidimensional anonymization algorithm over big data using map reduce on public cloud. J Theor Appl Inf Technol. 2015;74:221–31.
Pandilakshmi K, Banu GR. An advanced bottom up generalization approach for big data on cloud. Int J Comput Algor. 2014;3:1054–9.
Balusamy M, Muthusundari S Data anonymization through generalization using map reduce on cloud. In: 2014 international conference on computer communication and systems. IEEE. 2014, p. 039–42. https://doi.org/10.1109/icccs.2014.7068164.
Pandilakshmi K, Banu GR. An advanced bottom up generalization approach for big data on cloud. 2014;3:1054–9.
Zhang X, Liu C, Yang C, Chen J, Nepal S, Dou W. A hybrid approach for scalable sub-tree anonymization over big data using MapReduce on cloud. 2014;80:1008–20. https://doi.org/10.1016/j.jcss.2014.02.007.
Mehta BB, Rao UP. Privacy preserving big data publishing: a scalable k-anonymization approach using MapReduce. IET Softw. 2017;11:271–6. https://doi.org/10.1049/iet-sen.2016.0264.
Roy I, Setty ST, Vitaly A, Emmettwitchel S. Airavat: security and privacy for MapReduce CiteSeer. 2010. https://doi.org/10.1.1.188.8573.
Zhang X, Yang LT, Liu C, Chen J. A scalable two-phase top-down specialization approach for data anonymization using MapReduce on cloud. IEEE Trans Parallel Distrib Syst. 2014. https://doi.org/10.1109/tpds.2013.48.
Rajeev Motwani YX (2007) Efficient Algorithms for Masking and Finding Quasi-Identifiers.
Al-Zobbi M, Shahrestani S, Ruan C (2017) Implementing A Framework for Big Data Anonymity and Analytics Access Control. In: 2017 IEEE Trustcom/BigDataSE/ICESS. 2017, p. 873–80. https://doi.org/10.1109/trustcom/bigdatase/icess.2017.325.
Al-Zobbi M, Shahrestani S, Ruan C (2017) Towards optimal sensitivity-based anonymization for big data. In: 2017 27th international telecommunication networks and applications conference (ITNAC). 2017. p. 1–6. https://doi.org/10.1109/atnac.2017.8215371.
Al-Zobbi M, Shahrestani S, Ruan C. Sensitivity-based anonymization of big data. In: Local computer networks workshops (LCN Workshops), 2016 IEEE 41st conference 2016. 2016; p. 58–64. https://doi.org/10.1109/lcn.2016.029.
Shoro AG, Soomro TR. Big data analysis: apache spark perspective. Glob J Comput Sci Technol. 2015;15. https://computerresearch.org/index.php/computer/article/view/1137
Frampton M. Mastering apache spark. Birmingham: Packt Publishing Ltd; 2015. pp. 163–270.
Gopalani S, Arora R. Comparing apache spark and map reduce with performance analysis using K-means. Int J Comput Appl. 2015. https://doi.org/10.5120/19788-0531.
West DB. Introduction to graph theory, vol. 2. Upper Saddle River: Prentice hall; 2001.
Low Y, Gonzalez JE, Kyrola A, Bickson D, Guestrin CE, J Hellerstein. Graphlab: a new framework for parallel machine learning arXiv preprint. 2014. arXiv:14082041.
Chodorow K. MongoDB. Sebastopol: O’Reilly Media; 2010.
Abbasi MA. Learning apache spark 2. 1st ed. Birmingham: Packt Publishing; 2017. p. 2017.
Li M, Tan J, Wang Y, Zhang L, Salapura V (2015) Sparkbench: a comprehensive benchmarking suite for in memory data analytic platform spark. In: Proceedings of the 12th ACM international conference on computing frontiers. 2015. p. 53. https://doi.org/10.1145/2742854.2747283.
Becker RK. Adults Data. 1996. ftp://ftp.ics.uci.edu/pub/machine-learning-databases.
Halstead B. MYSQL: generate calendar table. github, github. 2012
Reiter JP. Satisfying disclosure restrictions with synthetic data sets. J Off Stat. 2002;18:531.