Automatic thesaurus for enhanced Chinese text retrieval

Emerald - 2000

SchubertFoo¹, SiuCheung Hui², HongKoon Lim³, LiHui⁴

¹Schubert Foo is the Head and Associate Professor of the Division of Information Studies at Nanyang Technological University, Singapore.

²Siu Cheung Hui is an Associate Professor in the Division of Software Systems at the Nanyang Technological University, Singapore.

³Hong Koon Lim is a Computer Engineering Graduate from the Nanyang Technological University, Singapore.

⁴Li Hui is a Library and Information Science Graduate from Peking University.

Tóm tắt

Asian languages such as Japanese, Korean and in particular Chinese, are beginning to gain popularity in the information retrieval (IR) domain. The quality of IR systems has traditionally been judged by the system’s retrieval effectiveness which, in turn, is commonly measured by data recall and data precision. This paper proposes and describes a process for generating an automatic Chinese thesaurus that can be used to provide related terms to a user’s queries to enhance retrieval effectiveness. In the absence of existing automatic Chinese thesauri, techniques used in English thesaurus generation have been evaluated and adapted to generate a Chinese equivalent. The automatic thesaurus is generated by computing the co‐occurrence values between domain‐specific terms found in a document collection. These co‐occurrence values are in turn derived from the term and document frequencies of the terms. A set of experiments was subsequently carried out on a document test set to evaluate the applicability of the thesaurus. Results obtained from these experiments confirmed that such an automatic generated thesaurus is able to improve the retrieval effectiveness of a Chinese IR system.

Từ khóa

Tài liệu tham khảo

Chen, H. and Lynch, K.J. (1992), “Automatic construction of networks of concepts characterizing document databases”, IEEE Transaction on Systems, Man and Cybernetics, Vol. 22 No. 5, pp. 885‐902.

Chen, H., Schatz, B., Yim, T. and Fye, D. (1995), “Automatic thesaurus generation for an electronic community system”, Journal of the American Society for Information Science, Vol. 46 No. 3, pp. 175‐93.

Frakes, W.B. and Yates, R.B. (1992), Information Retrieval: Data Structures and Algorithms, Prentice‐Hall, Englewood Cliffs, NJ.

Jones, K.S. (1971), Automatic Keyword Classification for Information Retrieval, Butterworths, London.

Kwok, K.L. (1997a), “Comparing representations in Chinese information retrieval”, Proceedings of 20th Annual International ACM SIGIR Conference on R&D in Information Retrieval, pp. 34‐41, (also available online at http://ir.cs.qc.edu/#publi_>)

Kwok, K.L. (1997b), “Lexicon effects on Chinese information retrieval”, Proceedings of 2nd Conference on Empirical Methods in Natural Language Processing, ACL, pp. 141‐8, (also available online at http://ir.cs.qc.edu/#publi_>)

Li, H. (1998), “Text segmentation for Chinese information retrieval’, MASc First year report, School of Applied Science, Nanyang Technological University, Singapore, 1999.

Lim, H.K. (1999), “Chinese text retrieval system”, MPhil dissertation, School of Applied Science, Nanyang Technological University, Singapore.

Peat, H.J. and Willet, P. (1991), “The limitations of term co‐occurrence data for query expansion in document retrieval systems”, Journal of the American Society for Information Science, Vol. 42 No. 5, pp. 378‐83.

Salton, G. (1978), “Generation and search of clustered files”, ACM Transactions on Database Systems, Vol. 3 No. 4, pp. 321‐46.

Salton, G. (1989), Automatic Text Processing, Addison‐Wesley Publishing Company, Reading, MA.

Smeaton, A. and Wilkinson, R. (1996), “Spanish and Chinese document retrieval in TREC5”, NIST Special Publication 500‐238: The Fifth Text Retrieval Conference (TREC‐5), Maryland, pp. 57‐94, (also available online at http://trec.nist.gov/pubs/trec5/t5_proceedings.html>)

Tong, X., Zhai, C.X., Milic‐Frayling, N. and Evans, D.A. (1996), “Experiments on Chinese text indexing – CLARIT TREC – 5 Chinese track report”, NIST Special Publication 500‐238: The Fifth Text Retrieval Conference (TREC‐5), Maryland, pp. 335‐40, (also available online at http://trec.nist.gov/pubs/trec5/t5_proceedings.html>)

Witten, I.H., Moffat, A. and Bell, T.C. (1994), Managing Gigabytes: Compressing and Indexing Documents and Images, Van Nostrand Reinhold, New York, NY.

Wu, Z.M. and Tseng, G. (1993), “Chinese text segmentation for text retrieval: achievements and problems”, Journal of the American Society for Information Science, Vol. 44 No. 9, pp. 532‐42.

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA