Set similarity join on massive probabilistic data using MapReduce

Springer Science and Business Media LLC - Tập 32 - Trang 447-464 - 2013

Youzhong Ma¹, Xiaofeng Meng¹

¹School of Information, Renmin University of China, Beijing, China

Tóm tắt

In this paper, we focus on set similarity join on massive probabilistic data using MapReduce, there is no effective approach that can process this problem efficiently. MapReduce is a popular paradigm that can process large volume data more efficiently, in this paper, we proposed two approaches using MapReduce to deal with this task: Hadoop Join by Map Side Pruning and Hadoop Join by Reduce Side Pruning. Hadoop Join by Map Side Pruning uses the sum of the existence probability to filter out the probabilistic sets directly at the Map task side which have no any chance to be similar with any other probabilistic set. Hadoop Join by Reduce Side Pruning uses probability sum based pruning principle and probability upper bound based pruning principle to reduce the candidate pairs at Reduce task side, it can save the comparison cost. Based on the above approaches, we proposed a hybrid solution that employs both Map-side and Reduce-side pruning methods. Finally we implemented the above approaches on Hadoop-0.20.2 and performed comprehensive experiments to their performance, we also test the speedup ratio compared with the naive method: Block Nested Loop Join. The experiment results show that our approaches have much better performance than that of Block Nested Loop Join and also have good scalability. To the best of our knowledge, this is the first work to try to deal with set similarity join on massive probabilistic data problem using MapReduce paradigm, and the approaches proposed in this paper provide a new way to process the massive probabilistic data.

Tài liệu tham khảo

Afrati, F.N., Sarma, A.D., Menestrina, D., Parameswaran, A.G., Ullman, J.D.: Fuzzy joins using MapReduce. In: ICDE’12, pp. 498–509 (2012) Afrati, F.N., Ullman, J.D.: Optimizing joins in a map-reduce environment. In: EDBT’10, pp. 99–110 (2010) Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB’06, pp. 918–929 (2006) Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB’06, pp. 918–929 (2006) Baraglia, R., Morales, G.D.F., Lucchese, C.: Document similarity self-join with MapReduce. In: ICDM’10, pp. 731–736 (2010) Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW’07, pp. 131–140 (2007) Broder, Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Comput. Netw. (1997). doi:10.1016/S0169-7552(97)00031-7 Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE’06 (2006) Cheng, R., Singh, S., Prabhakar, S., Shah, R., Vitter, J.S., Xia, Y.: Efficient join processing over uncertain data. In: CIKM’06, pp. 738–747 (2006) Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI’04, pp. 137–150 (2004) Dong, X.L., Halevy, A.Y., Yu, C.: Data integration with uncertainty. VLDB J. (2009) doi:10.1007/s00778-008-0119-9 Elsayed, T., Lin, J.J., Oard, D.W.: Pairwise document similarity in large collections with MapReduce. In: ACL (Short Papers)’08, pp. 265–268 (2008) Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. In: SOSP’03, pp. 29–43 (2003) Henzinger, M.R.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: SIGIR’06, pp. 284–291 (2006) Jestes, J., Li, F., Yan, Z., Yi, K.: Probabilistic string similarity joins. In: SIGMOD Conference’10, pp. 327–338 (2010) Kim, Y., Shim, K.: Parallel top-k similarity join algorithms using MapReduce. In: ICDE, pp. 510–521 (2012) Kimura, H., Madden, S., Zdonik, S.B.: Upi: a primary index for uncertain databases. In: PVLDB, pp. 630–637 (2010) Kriegel, H.P., Kunath, P., Pfeifle, M., Renz, M.: Probabilistic similarity join on uncertain data. In: DASFAA’06, pp. 295–309 (2006) Lian, X., Chen, L.: Set similarity join on probabilistic data. In: PVLDB, pp. 650–659 (2010) Luo, W., Tan, H., Mao, H., Ni, L.: Efficient similarity joins on massive high-dimensional datasets using MapReduce. In: MDM’12, p. TBA (2012) Okcan, A., Riedewald, M.: Processing theta-joins using MapReduce. In: SIGMOD Conference’11, pp. 949–960 (2011) Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using MapReduce. In: SIGMOD Conference’10, pp. 495–506 (2010) Xiao, C., Wang, W., Lin, X., Yu, J.X., Wang, G.: Efficient similarity joins for near-duplicate detection. ACM Trans. Database Syst. (2011). doi:10.1145/2000824.2000825 Yang, H.-c., Dasdan, A., Hsiao, R.-L., Stott Parker, D.: Map-reduce-merge: simplified relational data processing on large clusters. In: SIGMOD Conference’07, pp. 1029–1040 (2007)

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Công cụ kiểm tra chính tả và thể thức Viver

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA