Swoosh: a generic approach to entity resolution

The VLDB Journal - Tập 18 Số 1 - Trang 255-276 - 2009

Omar Benjelloun¹, Héctor García-Molina², David Menestrina³, Qi Su³, Steven Euijong Whang³, Jennifer Widom³

¹Google Inc., Mountain View, CA 94043, USA

²Computer Science Department, Stanford University, Stanford, CA, 94305, USA

³Computer Science Department, Stanford University, Stanford, CA 94305 USA

Tóm tắt

Từ khóa

Tài liệu tham khảo

Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating fuzzy duplicates in data warehouses. In: Proceedings of VLDB, pp. 586–597 (2002)

Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB, pp. 918–929 (2006)

Bansal, N., Blum, A., Chawla, S.: Correlation clustering. In: FOCS, p. 238 (2002)

Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: Proceedings of ACM SIGKDD’03 Workshop on Data Cleaning, Record Linkage, and Object Consolidation (2003). http://citeseer.ist.psu.edu/article/baxter03comparison.html

Bekkerman, R., McCallum, A.: Disambiguating web appearances of people in a social network. In: WWW, pp. 463–470 (2005)

Benjelloun, O., Garcia-Molina, H., Jonas, J., Menestrina, D., Whang, S., Su, Q., Widom, J.: Swoosh : a generic approach to entity resolution. Technical Report, Stanford University (2006). http://dbpubs.stanford.edu/pub/2005-5

Benjelloun, O., Garcia-Molina, H., Kawai, H., Larson, T.E., Menestrina, D., Thavisomboon, S.: D-Swoosh : a family of algorithms for generic, distributed entity resolution. In: ICDCS (2007)

Bhattacharya, I., Getoor, L.: Iterative record linkage for cleaning and integration. In: Proceedings of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (2004)

Bhattacharya, I., Getoor, L.: A latent dirichlet model for unsupervised entity resolution. In: Sixth SIAM Conference on Data Mining (2006)

Blume, M.: Automatic entity disambiguation: benefits to NER, relation extraction, link analysis, and inference. In: International Conference on Intelligence Analysis (2005). https://analysis.mitre.org/

Chaudhuri, S., Ganjam, K., Ganti, V., Motwani, R.: Robust and efficient fuzzy match for online data cleaning. In: Proceedings of ACM SIGMOD, pp. 313–324 (2003)

Chaudhuri, S., Ganti, V., Motwani, R.: Robust identification of fuzzy duplicates. In: Proceedings of ICDE, Tokyo, Japan (2005)

Cohen, W.: Data integration using similarity joins and a word-based information representation language. ACM Trans. Inf. Syst. 18, 288–321 (2000)

Dong, X., Halevy, A.Y., Madhavan, J.: Reference reconciliation in complex information spaces. In: Proceedings of ACM SIGMOD (2005)

Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969)

Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.A.: Declarative data cleaning: Language, model, and algorithms. In: Proceedings of VLDB, pp. 371–380 (2001)

Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)

Gu, L., Baxter, R., Vickers, D., Rainsford, C.: Record linkage: current practice and future directions. Technical Report 03/83, CSIRO Mathematical and Information Sciences (2003)

Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: Proceedings of ACM SIGMOD, pp. 127–138 (1995)

Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: data cleansing and the merge/purge problem. Data Min. Knowl. Discov. 2(1), 9–37 (1998)

IBM: DB2 Entity Analytic Solutions. http://www-306.ibm.com/software/data/db2/eas/

Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)

Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: Proceedings of International Conference on Database Systems for Advanced Applications, p. 137 (2003)

Kalashnikov, D.V., Mehrotra, S., Chen, Z.: Exploiting relationships for domain-independent data cleaning. In: Proceedings of the SIAM International Conference on Data Mining, Newport Beach, CA (2005)

McCallum, A.K., Nigam, K., Ungar, L.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of KDD, pp. 169–178, Boston, MA (2000)

Menestrina, D., Benjelloun, O., Garcia-Molina, H.: Generic entity resolution with data confidences. In: CleanDB (2006)

Monge, A.E., Elkan, C.: An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, pp. 23–29 (1997)

Motro, A., Anokhin, P.: Fusionplex: resolution of data inconsistencies in the integration of heterogeneous information sources. Inf. Fusion 7(2), 176–196 (2006)

Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)

Parag, D.P.: Multi-relational record linkage. In: Proceedings of the KDD-2004 Workshop on Multi-Relational Data Mining, pp. 31–48 (2004)

Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of ACM SIGKDD, Edmonton, Alberta (2002)

Schallehn, E., Sattler, K.U., Saake, G.: Extensible and similarity-based grouping for data integratio. In: ICDE, p. 277 (2002)

Singla, P., Domingos, P.: Object identification with attribute-mediated dependences. In: Proceedings of PKDD, pp. 297 – 308 (2005)

Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)

Tarjan, R.E.: Efficiency of a good but not linear set union algorithm. J. ACM. 22(2), 215–225 (1975)

Tejada, S., Knoblock, C.A., Minton, S.: Learning object identification rules for information integration. Inf. Syst. J. 26(8), 635–656 (2001)

Verykios, V.S., Moustakides, G.V., Elfeky, M.G.: A bayesian decision model for cost optimal record matching. VLDB J. 12(1), 28–40(2003). http://www.cs.purdue.edu/homes/mgelfeky/Papers/vldbj12(1 ).pdf

Winkler, W.: Overview of record linkage and current research directions. Technical Report, Statistical Research Division, U.S. Bureau of the Census, Washington, DC (2006)

Winkler, W.E.: Using the EM algorithm for weight computation in the Fellegi–Sunter model of record linkage. In: American Statistical Association, Proceedings of the Section on Survey Research Methods, pp. 667–671 (1988)

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích ảnh hưởng của các bài báo, công bố khoa học Việt Nam và Quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ SciBase

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Hệ thống hội thảo khoa học Việt Nam

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA

Thông tin liên hệ & hỗ trợ

Đơn vị chủ quản, phát triển và vận hành: Công ty Cổ phần Metis

Địa chỉ liên hệ: 26A Lê Đức Thọ, Phường Từ Liêm, Thành phố Hà Nội

Số giấy chứng nhận ĐKKD: 0109293202 cấp ngày 03/08/2020 tại Sở Kế hoạch và Đầu tư thành phố Hà Nội

Người quản lý và chịu trách nhiệm nội dung: Nguyễn Ngọc Sơn

Hotline: 0566.685.688

Email: [email protected]