Set similarity join on massive probabilistic data using MapReduce
Tóm tắt
In this paper, we focus on set similarity join on massive probabilistic data using MapReduce, there is no effective approach that can process this problem efficiently. MapReduce is a popular paradigm that can process large volume data more efficiently, in this paper, we proposed two approaches using MapReduce to deal with this task: Hadoop Join by Map Side Pruning and Hadoop Join by Reduce Side Pruning. Hadoop Join by Map Side Pruning uses the sum of the existence probability to filter out the probabilistic sets directly at the Map task side which have no any chance to be similar with any other probabilistic set. Hadoop Join by Reduce Side Pruning uses probability sum based pruning principle and probability upper bound based pruning principle to reduce the candidate pairs at Reduce task side, it can save the comparison cost. Based on the above approaches, we proposed a hybrid solution that employs both Map-side and Reduce-side pruning methods. Finally we implemented the above approaches on Hadoop-0.20.2 and performed comprehensive experiments to their performance, we also test the speedup ratio compared with the naive method: Block Nested Loop Join. The experiment results show that our approaches have much better performance than that of Block Nested Loop Join and also have good scalability. To the best of our knowledge, this is the first work to try to deal with set similarity join on massive probabilistic data problem using MapReduce paradigm, and the approaches proposed in this paper provide a new way to process the massive probabilistic data.
Tài liệu tham khảo
Afrati, F.N., Sarma, A.D., Menestrina, D., Parameswaran, A.G., Ullman, J.D.: Fuzzy joins using MapReduce. In: ICDE’12, pp. 498–509 (2012)
Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: EDBT’10, pp. 99–110 (2010)
Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB’06, pp. 918–929 (2006)
Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB’06, pp. 918–929 (2006)
Baraglia, R., Morales, G.D.F., Lucchese, C.: Document similarity self-join with MapReduce. In: ICDM’10, pp. 731–736 (2010)
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW’07, pp. 131–140 (2007)
Broder, Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Comput. Netw. (1997). doi:10.1016/S0169-7552(97)00031-7
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE’06 (2006)
Cheng, R., Singh, S., Prabhakar, S., Shah, R., Vitter, J.S., Xia, Y.: Efficient join processing over uncertain data. In: CIKM’06, pp. 738–747 (2006)
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI’04, pp. 137–150 (2004)
Dong, X.L., Halevy, A.Y., Yu, C.: Data integration with uncertainty. VLDB J. (2009) doi:10.1007/s00778-008-0119-9
Elsayed, T., Lin, J.J., Oard, D.W.: Pairwise document similarity in large collections with MapReduce. In: ACL (Short Papers)’08, pp. 265–268 (2008)
Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. In: SOSP’03, pp. 29–43 (2003)
Henzinger, M.R.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: SIGIR’06, pp. 284–291 (2006)
Jestes, J., Li, F., Yan, Z., Yi, K.: Probabilistic string similarity joins. In: SIGMOD Conference’10, pp. 327–338 (2010)
Kim, Y., Shim, K.: Parallel top-k similarity join algorithms using MapReduce. In: ICDE, pp. 510–521 (2012)
Kimura, H., Madden, S., Zdonik, S.B.: Upi: a primary index for uncertain databases. In: PVLDB, pp. 630–637 (2010)
Kriegel, H.P., Kunath, P., Pfeifle, M., Renz, M.: Probabilistic similarity join on uncertain data. In: DASFAA’06, pp. 295–309 (2006)
Lian, X., Chen, L.: Set similarity join on probabilistic data. In: PVLDB, pp. 650–659 (2010)
Luo, W., Tan, H., Mao, H., Ni, L.: Efficient similarity joins on massive high-dimensional datasets using MapReduce. In: MDM’12, p. TBA (2012)
Okcan, A., Riedewald, M.: Processing theta-joins using MapReduce. In: SIGMOD Conference’11, pp. 949–960 (2011)
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD Conference’10, pp. 495–506 (2010)
Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. (2011). doi:10.1145/2000824.2000825
Yang, H.-c., Dasdan, A., Hsiao, R.-L., Stott Parker, D.: Map-reduce-merge: simplified relational data processing on large clusters. In: SIGMOD Conference’07, pp. 1029–1040 (2007)