Privacy-Preserving Data Sharing by Integrating Perturbed Distance Matrices
Tóm tắt
Collecting large amounts of data is beneficial in machine learning to generate models that are less biased. There are many cases in which pieces of similar data are distributed among organizations, and it is difficult to integrate these data owing to issues involving privacy and cost. Integrating these distributed data without delivering the original data leads to the concept of data collaboration, which combines data held by different organizations in a secure manner. We propose a method in which a distance matrix of the original data obtained using common data among organizations is shared to learn neighbor information of the original data. Specifically, the proposed method robustly integrates distributed data, which is of as good quality as connected raw data, in cases where the amount of data in each organization is small and the data bias is large. In addition, the proposed method is applicable to data contaminated by noise. To demonstrate the effectiveness of the proposed method, we performed a classification task on open biological data divided into several pieces and found that the classification results for divided data were as precise as when all data were available. Finally, we show that the robustness of the method against noise improves the anonymity of the original data as a by-product.
Tài liệu tham khảo
Aggarwal CC, Philip SY. A general survey of privacy-preserving data mining models and algorithms. In: Yin Y, Kaku I, Tang J, Zhu JM, editors. Privacy-preserving data mining. New York: Springer; 2008. p. 11–52.
Agrawal R, Srikant R. Privacy-preserving data mining. In: ACM Sigmod Record, vol. 29. New York: ACM; 2000. p. 439–50.
Bonawitz K, Eichner H, Grieskamp W, Huba D, Ingerman A, Ivanov V, Kiddon C, Konecný J, Mazzocchi S, McMahan HB, Overveldt TV, Petrou D, Ramage D, Roselander J. Towards federated learning at scale: system design. 2019. arXiv:1902.01046.
Cai H, Zheng VW, Chang KC. A comprehensive survey of graph embedding: problems, techniques and applications. 2017. arXiv:1709.07604.
Chida K, Morohashi G, Fuji H, Magata F, Fujimura A, Hamada K, Ikarashi D, Yamamoto R. Implementation and evaluation of an efficient secure computation system using ‘R’ for healthcare statistics. J Am Med Inf Assoc. 2014;21(e2):e326–31.
Cui P, Wang X, Pei J, Zhu W. A survey on network embedding. 2017. arXiv:1711.08752.
Cunningham JP, Ghahramani Z. Linear dimensionality reduction: survey, insights, and generalizations. J Mach Learn Res. 2015;16:2859–900.
Du W, Atallah MJ. Secure multi-party computation problems and their applications: a review and open problems. In: Proceedings of the 2001 workshop on New security paradigms. ACM; 2001. p. 13–22.
Dua D, Graff C. UCI machine learning repository. 2017. http://archive.ics.uci.edu/ml.
Goyal P, Ferrara E. Graph embedding techniques, applications, and performance: a survey. 2017. CoRR arXiv:1705.02801.
Grover A, Leskovec J. Node2vec: scalable feature learning for networks. In: Proceedings of the 22Nd ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’16. New York: ACM; 2016. p. 855–64. https://doi.org/10.1145/2939672.2939754.
He X. Locality preserving projections. Ph.D. thesis, Chicago, IL, USA. 2005. AAI3195015.
Imakura A, Sakurai T. Data collaboration analysis framework using centralization of individual intermediate representations for distributed data sets. ASCE ASME J Risk Uncertain Eng Syst A Civ Eng. 2020;6(2):04020018.
Konečný J, McMahan HB, Yu FX, Richtarik P, Suresh AT, Bacon D. Federated learning: Strategies for improving communication efficiency. In: NIPS workshop on private multi-party machine learning. 2016. arXiv:1610.05492.
McMahan HB, Moore E, Ramage D, Hampson S, Arcas BA. Communication-efficient learning of deep networks from decentralized data. In: Proceedings of the 20th international conference on artificial intelligence and statistics (AISTATS). 2017. arXiv:1602.05629.
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ, editors. Advances in neural information processing systems, vol. 26. Red Hook: Curran Associates Inc; 2013. p. 3111–9.
Nikolaenko V, Weinsberg U, Ioannidis S, Joye M, Boneh D, Taft N. Privacy-preserving ridge regression on hundreds of millions of records. In: 2013 IEEE symposium on security and privacy. IEEE; 2013. p. 334–48.
Perozzi B, Al-Rfou R, Skiena S. Deepwalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’14. New York: ACM; 2014. p. 701–10. https://doi.org/10.1145/2623330.2623732.
Roweis ST, Saul LK. Nonlinear dimensionality reduction by locally linear embedding. Science. 2000;290:2323–6.
Sweeney L. k-anonymity: a model for protecting privacy. Int J Uncertain Fuzziness Knowl Based Syst. 2002;10(05):557–70.
Wagner I, Eckhoff D. Technical privacy metrics: a systematic survey. ACM Comput Surv CSUR. 2018;51(3):57.
Yao ACC. How to generate and exchange secrets. In: 27th annual symposium on foundations of computer science (SFCS 1986). IEEE; 1986. p. 162–7