Nội dung được dịch bởi AI, chỉ mang tính chất tham khảo

Nghiên cứu hệ thống về mối tương quan của các tham số trong phát hiện tài liệu trùng lặp quy mô lớn

Knowledge and Information Systems - Tập 14 - Trang 217-232 - 2007

Shaozhi Ye¹, Ji-Rong Wen², Wei-Ying Ma²

¹Department of Computer Science, University of California, Davis, USA

²Microsoft Research Asia, Beijing, P.R.China

Tóm tắt

Mặc dù đã có nhiều công trình nghiên cứu về phát hiện tài liệu trùng lặp (DDD) và các ứng dụng của nó, nhưng chúng tôi nhận thấy thiếu một nghiên cứu hệ thống về hiệu suất và khả năng mở rộng của các thuật toán DDD quy mô lớn. Vẫn chưa rõ cách mà các tham số khác nhau trong DDD tương quan với nhau, chẳng hạn như ngưỡng tương đồng, yêu cầu độ chính xác/nhớ, tỷ lệ mẫu và kích thước tài liệu. Bài báo này khám phá mối tương quan giữa một số tham số quan trọng nhất trong DDD và tác động của tỷ lệ mẫu là vấn đề được quan tâm nhất, vì nó ảnh hưởng đáng kể đến độ chính xác và khả năng mở rộng của các thuật toán DDD. Phân tích thực nghiệm được thực hiện trên một triệu tài liệu HTML từ bộ sưu tập TREC .GOV. Kết quả thí nghiệm cho thấy rằng ngay cả khi sử dụng cùng một tỷ lệ mẫu, độ chính xác của DDD thay đổi rất lớn trên các tài liệu có kích thước khác nhau. Dựa trên quan sát này, chúng tôi đề xuất một chiến lược sampling thích ứng cho DDD, nhằm giảm thiểu tỷ lệ mẫu với ràng buộc về yêu cầu độ chính xác nhất định. Chúng tôi tin rằng những hiểu biết từ phân tích của chúng tôi sẽ hữu ích cho việc định hướng công việc DDD quy mô lớn trong tương lai.

Từ khóa

#phát hiện tài liệu trùng lặp #thuật toán DDD #mối tương quan tham số #độ chính xác #tỷ lệ mẫu #khả năng mở rộng

Tài liệu tham khảo

Bellare M, Kohno T (2004) Hash function balance and its impact on birthday attacks. In: EUROCRYPT 2004: international conference on the theory and applications of cryptographic techniques, pp 401–418 Bharat K, Broder AZ (1999) Mirror, mirror on the Web: a study of host pairs with replicated content. In: Proceedings of the 8th international World Wide Web (WWW) conference, pp 501–512 Bharat K, Broder AZ, Dean J, Henzinger MR (2000) A comparison of techniques to find mirrored hosts on the WWW. J Am Soc Inf Sci (JASIS) 51(12):1114–1122 Brin S, Davis J, Garcia-Molina H (1995) Copy detection mechanisms for digital documents. In: Proceedings of the 1995 ACM international conference on management of data (SIGMOD), pp 398–409 Broder AZ, Glassman SC, Manasse MS, Zweig G (1997) Syntactic clustering of the Web. In: Proceedings of the sixth international World Wide Web (WWW) conference, pp 1157–1166 Cho J, Shivakumar N, Garcia-Molina H (2000) Finding replicated Web collections. In: Proceedings of the 2000 ACM international conference on management of data (SIGMOD), pp 355–366 Chowdhury A, Frieder O, Grossman D, McCabe MC (2002) Collection statistics for fast duplicate document detection. ACM Trans Inf Syst 20(2):171–191 Conrad JG, Guo XS, Schriber CP (2003) Online duplicate document detection: signature reliability in a dynamic retrieval environment. In: Proceedings of the 12th international conference on information and knowledge management (CIKM), pp 443–452 Cooper JW, Coden A, Brown EW (2002) Detecting similar documents using salient terms. In: Proceedings of the 11th ACM international conference on information and knowledge management (CIKM), pp 245–251 Crovella ME, Taqqu MS, Bestavros A (1998) Heavy-tailed probability distributions in the World Wide Web. In: Adler R, Feldman R, Taqqu M (eds.) A practical guide to heavy tails: statistical techniques and applications. Birkhauser, Boston, pp 3–25 Dean J, Henzinger MR (1999) Finding related pages in the World Wide Web. In: Proceeding of the 8th international World Wide Web conference (WWW), pp 1467–1479 Fetterly D, Manasse M, Najork M (2003) On the evolution of clusters of near-duplicate Web pages. In: Proceedings of the 1st Latin American Web Congress (LA-Web), pp 37–45 Fetterly D, Manasse M, Najork M (2004) Spam, damn spam, and statistics: using statistical analysis to locate spam Web pages. In: Proceedings of the 7th international workshop on the Web and databases (WebDB), pp 1–6 Fetterly D, Manasse M, Najork M, Wiener J (2003) A large-scale study of the evolution of Web pages. In: Proceedings of the 12th international World Wide Web (WWW) conference, pp 669–678 Gyongyi Z, Garcia-Molina H (2005) Web spam taxonomy, Technical report, Stanford University Heintze N (1996) Scalable document fingerprinting. In: Proceedings of the 2nd USENIX electronic commerce workshop, pp 191–200 Li Z, Ng WK, Sun A (2005) Web data extraction based on structural similarity. Knowl Inf Syst 8(4):438–461 Mukherjea1 S (2004) Discovering and analyzing World Wide Web collections. Knowl Inf Syst 6(2):230–241 Rabin M (1981) Fingerprinting by random polynomials, Technical report tr-15-81, Center for Research in Computing Technology, Harvard University Shivakumar N, Garcia-Molina H (1998) Finding near-replicas of documents and servers on the Web. In: Proceedings of the 1st international workshop on World Wide Web and Databases (WebDB), pp 204–212 Soboroff I (2002) Do TREC Web collections look like the Web? SIGIR Forum 36(2):23–31 Wang Y, Kitsuregawa M (2002) Evaluating contents-link coupled Web page clustering for Web search results. In: Proceedings of the 11th ACM international conference on information and knowledge management (CIKM), pp 499–506 Ye S, Song R, Wen J-R, Ma W-Y (2004) A query-dependent duplicate detection approach for large scale search engines. In: Proceedings of the 6th Asia-Pacific Web conference (APWeb), pp 48–58 Yi L, Liu B, Li X (2003) Eliminating noisy information in Web pages for data mining. In: Proceedings of the 9th ACM international conference on knowledge discovery and data mining (SIGKDD), pp 296–305 Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21st annual international ACM conference on research and development in information retrieval (SIGIR), pp 46–54 Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8(3):374–384

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích ảnh hưởng của các bài báo, công bố khoa học Việt Nam và Quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ SciBase

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Hệ thống hội thảo khoa học Việt Nam

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA

Thông tin liên hệ & hỗ trợ

Đơn vị chủ quản, phát triển và vận hành: Công ty Cổ phần Metis

Địa chỉ liên hệ: 26A Lê Đức Thọ, Phường Từ Liêm, Thành phố Hà Nội

Số giấy chứng nhận ĐKKD: 0109293202 cấp ngày 03/08/2020 tại Sở Kế hoạch và Đầu tư thành phố Hà Nội

Người quản lý và chịu trách nhiệm nội dung: Nguyễn Ngọc Sơn

Hotline: 0566.685.688

Email: [email protected]