Automatic thesaurus for enhanced Chinese text retrieval

Emerald - 2000
SchubertFoo1, SiuCheung Hui2, HongKoon Lim3, LiHui4
1Schubert Foo is the Head and Associate Professor of the Division of Information Studies at Nanyang Technological University, Singapore.
2Siu Cheung Hui is an Associate Professor in the Division of Software Systems at the Nanyang Technological University, Singapore.
3Hong Koon Lim is a Computer Engineering Graduate from the Nanyang Technological University, Singapore.
4Li Hui is a Library and Information Science Graduate from Peking University.

Tóm tắt

Asian languages such as Japanese, Korean and in particular Chinese, are beginning to gain popularity in the information retrieval (IR) domain. The quality of IR systems has traditionally been judged by the system’s retrieval effectiveness which, in turn, is commonly measured by data recall and data precision. This paper proposes and describes a process for generating an automatic Chinese thesaurus that can be used to provide related terms to a user’s queries to enhance retrieval effectiveness. In the absence of existing automatic Chinese thesauri, techniques used in English thesaurus generation have been evaluated and adapted to generate a Chinese equivalent. The automatic thesaurus is generated by computing the co‐occurrence values between domain‐specific terms found in a document collection. These co‐occurrence values are in turn derived from the term and document frequencies of the terms. A set of experiments was subsequently carried out on a document test set to evaluate the applicability of the thesaurus. Results obtained from these experiments confirmed that such an automatic generated thesaurus is able to improve the retrieval effectiveness of a Chinese IR system.

Từ khóa


Tài liệu tham khảo

Chen, H. and Lynch, K.J. (1992), “Automatic construction of networks of concepts characterizing document databases”, IEEE Transaction on Systems, Man and Cybernetics, Vol. 22 No. 5, pp. 885‐902.

Chen, H., Schatz, B., Yim, T. and Fye, D. (1995), “Automatic thesaurus generation for an electronic community system”, Journal of the American Society for Information Science, Vol. 46 No. 3, pp. 175‐93.

Frakes, W.B. and Yates, R.B. (1992), Information Retrieval: Data Structures and Algorithms, Prentice‐Hall, Englewood Cliffs, NJ.

Jones, K.S. (1971), Automatic Keyword Classification for Information Retrieval, Butterworths, London.

Kwok, K.L. (1997a), “Comparing representations in Chinese information retrieval”, Proceedings of 20th Annual International ACM SIGIR Conference on R&D in Information Retrieval, pp. 34‐41, (also available online at http://ir.cs.qc.edu/#publi_>)

Kwok, K.L. (1997b), “Lexicon effects on Chinese information retrieval”, Proceedings of 2nd Conference on Empirical Methods in Natural Language Processing, ACL, pp. 141‐8, (also available online at http://ir.cs.qc.edu/#publi_>)

Li, H. (1998), “Text segmentation for Chinese information retrieval’, MASc First year report, School of Applied Science, Nanyang Technological University, Singapore, 1999.

Lim, H.K. (1999), “Chinese text retrieval system”, MPhil dissertation, School of Applied Science, Nanyang Technological University, Singapore.

Peat, H.J. and Willet, P. (1991), “The limitations of term co‐occurrence data for query expansion in document retrieval systems”, Journal of the American Society for Information Science, Vol. 42 No. 5, pp. 378‐83.

Salton, G. (1978), “Generation and search of clustered files”, ACM Transactions on Database Systems, Vol. 3 No. 4, pp. 321‐46.

Salton, G. (1989), Automatic Text Processing, Addison‐Wesley Publishing Company, Reading, MA.

Smeaton, A. and Wilkinson, R. (1996), “Spanish and Chinese document retrieval in TREC5”, NIST Special Publication 500‐238: The Fifth Text Retrieval Conference (TREC‐5), Maryland, pp. 57‐94, (also available online at http://trec.nist.gov/pubs/trec5/t5_proceedings.html>)

Tong, X., Zhai, C.X., Milic‐Frayling, N. and Evans, D.A. (1996), “Experiments on Chinese text indexing – CLARIT TREC – 5 Chinese track report”, NIST Special Publication 500‐238: The Fifth Text Retrieval Conference (TREC‐5), Maryland, pp. 335‐40, (also available online at http://trec.nist.gov/pubs/trec5/t5_proceedings.html>)

Witten, I.H., Moffat, A. and Bell, T.C. (1994), Managing Gigabytes: Compressing and Indexing Documents and Images, Van Nostrand Reinhold, New York, NY.

Wu, Z.M. and Tseng, G. (1993), “Chinese text segmentation for text retrieval: achievements and problems”, Journal of the American Society for Information Science, Vol. 44 No. 9, pp. 532‐42.