An example-based mapping method for text categorization and retrieval

ACM Transactions on Information Systems - Tập 12 Số 3 - Trang 252-277 - 1994
Yiming Yang1, Christopher G. Chute1
1Mayo Clinic

Tóm tắt

A unified model for text categorization and text retrieval is introduced. We use a training set of manually categorized documents to learn word-category associations, and use these associations to predict the categories of arbitrary documents. Similarly, we use a training set of queries and their related documents to obtain empirical associations between query words and indexing terms of documents, and use these associations to predict the related documents of arbitrary queries. A Linear Least Squares Fit (LLSF) technique is employed to estimate the likelihood of these associations. Document collections from the MEDLINE database and Mayo patient records are used for studies on the effectiveness of our approach, and on how much the effectiveness depends on the choices of training data, indexing language, word-weighting scheme, and morphological canonicalization. Alternative methods are also tested on these data collections for comparison. It is evident that the LLSF approach uses the relevance information effectively within human decisions of categorization and retrieval, and achieves a semantic mapping of free texts to their representations in an indexing language. Such a semantic mapping lead to a significant improvement in categorization and retrieval, compared to alternative approaches.

Từ khóa


Tài liệu tham khảo

CHUTE , C. G. , AND YANG , Y. 1992 . An evaluatmn of concept based Latent Semantic Indexing for clinical information retrieval . In Proceedings of the 16th Annual Symposzum on Computer Applications ~n Medical Care , vol. 16 . McGraw-HilL New York , 639 - 643 . CHUTE, C. G., AND YANG, Y. 1992. An evaluatmn of concept based Latent Semantic Indexing for clinical information retrieval. In Proceedings of the 16th Annual Symposzum on Computer Applications ~n Medical Care, vol. 16. McGraw-HilL New York, 639-643.

CPHA . 1986. International Classifice , tion of Dzseases . 9th Rev. Clinical Modifications. Commission on Professional and Hospital Activities. Ann Arbor , Mich . CPHA. 1986. International Classifice, tion of Dzseases. 9th Rev. Clinical Modifications. Commission on Professional and Hospital Activities. Ann Arbor, Mich.

DEERWESTER , S. , DUMAIS , S. T. , FURNAS , G.W. , LANDAUER , T. K. , AND HARSHMAN , R. 1990 . Indexing by Latent Semantic analysis . J. Am. Soc. Inf. Sci. 41 , 6 , 391 - 407 . DEERWESTER, S., DUMAIS, S. T., FURNAS, G.W., LANDAUER, T. K., AND HARSHMAN, R. 1990. Indexing by Latent Semantic analysis. J. Am. Soc. Inf. Sci. 41, 6, 391-407.

DONGARRA , J. J. , MOLER , C. B. , BUNCH , J. R. , AND STEWART , C.W. 1979. LINPACK Users' Guide . SIAM , Philadelphia, Pa . DONGARRA, J. J., MOLER, C. B., BUNCH, J. R., AND STEWART, C.W. 1979. LINPACK Users' Guide. SIAM, Philadelphia, Pa.

DSC. 1991. M++ Class Library User Guide. Rel. 3. Dyad Software Corporation Bellevue Wash. DSC. 1991. M++ Class Library User Guide. Rel. 3. Dyad Software Corporation Bellevue Wash.

EVANS , D. A. , CHUTE , C. G. , HANDERSON , S. K. , YANG , Y. , MONARCH , I. A. , AND HERSH , W. R. 1992 . Mapping vocabularies using "Latent Semantics ." In MEDINFO 92. 1462 - 1468 . EVANS, D. A., CHUTE, C. G., HANDERSON, S. K., YANG, Y., MONARCH, I. A., AND HERSH, W. R. 1992. Mapping vocabularies using "Latent Semantics." In MEDINFO 92. 1462-1468.

EVANS , D. A. , HERSH , W. R. , MONARCH , I. A. , LEFFERTS , R. G. , AND HANDERSON , S.K. 1991 . Automatic indexing of abstracts via natural-language processing using a simple thesaurus . Medical Decision Making 11 , 4 , 108 - 115 . EVANS, D. A., HERSH, W. R., MONARCH, I. A., LEFFERTS, R. G., AND HANDERSON, S.K. 1991. Automatic indexing of abstracts via natural-language processing using a simple thesaurus. Medical Decision Making 11, 4, 108-115.

10.1145/125187.125189

FUHR , N. , ET AL . 1991 . AIR/X--a rule-based multistage indexing systems for large subject fields . In Proceedings of the RIAO'91 . 606 - 623 . FUHR, N., ET AL. 1991. AIR/X--a rule-based multistage indexing systems for large subject fields. In Proceedings of the RIAO'91. 606-623.

GOLUB , G. B. , AND VAN LOAN , C.E. 1989. Matrix Computattons . 2 nd ed. The John Hopkins University Press , Baltimore, Md . GOLUB, G. B., AND VAN LOAN, C.E. 1989. Matrix Computattons. 2nd ed. The John Hopkins University Press, Baltimore, Md.

HAYNES , R. , Mc KSBBON , K. , WALKER , C. , RYAN , N. , FITZGERALD , D. , AND RAMSDEN , M. 1990 . Online access to MEDLINE in clinical settings . Ann. Int. Med. 112 , 1, 78 84. HAYNES, R., McKSBBON, K., WALKER, C., RYAN, N., FITZGERALD, D., AND RAMSDEN, M. 1990. Online access to MEDLINE in clinical settings. Ann. Int. Med. 112, 1, 78 84.

HERSH , W. R. , HICKAM , D. H. , AND LEONE , T.J. 1992 . Words, concepts, or both: Optimal indexing units for automated information retrieval . In Proceedings of the 16th Annual Symposium on Computer AppDcations in Medical Core, voL 16 . McGraw-Hill, New York, 644 648. HERSH, W. R., HICKAM, D. H., AND LEONE, T.J. 1992. Words, concepts, or both: Optimal indexing units for automated information retrieval. In Proceedings of the 16th Annual Symposium on Computer AppDcations in Medical Core, voL 16. McGraw-Hill, New York, 644 648.

LAWSON , C. L. , AND HANSON , R. J. 1974. Solving Least Squares Problems . Prentice-Hall , Englewood Cliffs~ N.J. LAWSON, C. L., AND HANSON, R. J. 1974. Solving Least Squares Problems. Prentice-Hall, Englewood Cliffs~ N.J.

10.3115/112405.112471

NLM . 1993. Medical Subject Headings (MESH) . National Library of Medicine, Bethesda , Md . NLM. 1993. Medical Subject Headings (MESH). National Library of Medicine, Bethesda, Md.

SALTON , G. 1991 . Development in automatic text retrieval . Science 253 , 974 - 980 . SALTON, G. 1991. Development in automatic text retrieval. Science 253, 974-980.

SALTON , G. 1989. Automatic Text Processing: The Transformatton. Analysis, and Retrieval of Information by Computer . Addison-Wesley , Reading, Mass . SALTON, G. 1989. Automatic Text Processing: The Transformatton. Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, Mass.

SALTON , G. , AND BUCKLEY , C. 1990 . Improving retrieval performance by relevance feedback . J. Am. Soc. Inf. Sci. 41 , 4 , 288 - 297 . SALTON, G., AND BUCKLEY, C. 1990. Improving retrieval performance by relevance feedback. J. Am. Soc. Inf. Sci. 41, 4, 288-297.

YANG , Y. , AND CHUTE , C. G. 1993 a. An Application of least squares fit mapping to text information retrieval . In Proc. of the 16th Annual Internatwnal ACM SIGJR Conference on Research and Development in Information Retrieval. ACM , New York , 281 - 290 . 10.1145/160688.160738 YANG, Y., AND CHUTE, C. G. 1993a. An Application of least squares fit mapping to text information retrieval. In Proc. of the 16th Annual Internatwnal ACM SIGJR Conference on Research and Development in Information Retrieval. ACM, New York, 281-290. 10.1145/160688.160738

YANG , Y. , AND CHUTE , C.G. 1993 b. Words or concepts: The features of indexing units and their optimal use in information retrieval . In Proceedings of the 17th Annual Symposium on Computer Apphcations tn Medical Cure , vol. 17 . McGraw-Hill, New York , 685 - 689 . YANG, Y., AND CHUTE, C.G. 1993b. Words or concepts: The features of indexing units and their optimal use in information retrieval. In Proceedings of the 17th Annual Symposium on Computer Apphcations tn Medical Cure, vol. 17. McGraw-Hill, New York, 685-689.

Y~X~G , Y. , AND CHUTE , C.G. 1992 . A linear least squares fit mapping method for information retrieval from natural language texts . In Proceedings of the 14th International Conference on Computational Linguisttcs (COLING 92) . McGraw-Hill, New York, 447 453. 10.3115/992133.992139 Y~X~G, Y., AND CHUTE, C.G. 1992. A linear least squares fit mapping method for information retrieval from natural language texts. In Proceedings of the 14th International Conference on Computational Linguisttcs (COLING 92). McGraw-Hill, New York, 447 453. 10.3115/992133.992139