Classifying documents with link-based bibliometric measures

Springer Science and Business Media LLC - Tập 13 - Trang 315-345 - 2009
T. Couto1, N. Ziviani1, P. Calado2, M. Cristo3, M. Gonçalves1, E. S. de Moura4, W. Brandão1
1Department of Computer Science, Federal University of Minas Gerais, Belo Horizonte, Brazil
2IST/INESC-ID, Lisbon, Portugal
3FUCAPI-Analysis, Research and Tech. Innovation Center, Manaus, Brazil
4Department of Computer Science, Federal University of Amazonas, Manaus, Brazil

Tóm tắt

Automatic document classification can be used to organize documents in a digital library, construct on-line directories, improve the precision of web searching, or help the interactions between user and search engines. In this paper we explore how linkage information inherent to different document collections can be used to enhance the effectiveness of classification algorithms. We have experimented with three link-based bibliometric measures, co-citation, bibliographic coupling and Amsler, on three different document collections: a digital library of computer science papers, a web directory and an on-line encyclopedia. Results show that both hyperlink and citation information can be used to learn reliable and effective classifiers based on a kNN classifier. In one of the test collections used, we obtained improvements of up to 69.8% of macro-averaged F 1 over the traditional text-based kNN classifier, considered as the baseline measure in our experiments. We also present alternative ways of combining bibliometric based classifiers with text based classifiers. Finally, we conducted studies to analyze the situation in which the bibliometric-based classifiers failed and show that in such cases it is hard to reach consensus regarding the correct classes, even for human judges.

Tài liệu tham khảo

ACM. (1998). The ACM computing classification system—1998 version. http://www.acm.org/class/1998/ccs98.html. Almind, T. C., & Ingwersen, P. (1997). Informetric analyses on the World Wide Web: Methodological approaches to “webometrics”. Journal of Documentation, 53(4), 4004–426. Amsler, R. (1972). Application of citation-based automatic classification. Tech. rep., The University of Texas at Austin, Linguistics Research Center. Angelova, R., & Weikum, G. (2006). Graph-based text classification: Learn from your neighbors. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, pp 485–492. Bichtler, J., & Eaton, E. A., III. (1980). The combined use of bibliographic coupling and cocitation for document retrieval. Journal of the American Society for Information Science, 31(4), 278–282. Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th international world wide web conference (WWW98), pp. 107–117. Calado, P., Cristo, M., Gonçalves, M. A., de Moura, E. S., Ribeiro-Neto, B., Ziviani, N. (2006). Link-based similarity measures for the classification of web documents. Journal of the American Society for Information Science and Technology, 57(2), 208–221. Calado, P., Cristo, M., Moura, E., Ziviani, N., Ribeiro-Neto, B., & Gonçalves, M. A. (2003). Combining link-based and content-based methods for web document classification. In Proceedings of the 12th international conference on information and knowledge management. New Orleans, LA, USA, pp. 394–401. Chakrabarti, S., Dom, B., & Indyk, P. (1998). Enhanced hypertext categorization using hyperlinks. In Proceedings of the ACM SIGMOD international conference on management of data, pp. 307–318. Chang, C., & Lin, C. J. (2001). Libsvm: A library for support vector machines. Cochran, W. G. (1977). Sampling techniques (2nd ed.). New York: Wiley. Cohn, D., & Hofmann, T. (2001). The missing link—a probabilistic model of document content and hypertext connectivity. In T. K. Leen, T. G. Dietterich, V. Tresp (Eds.) Advances in neural information processing systems 13 (pp. 430–436). Cambridge: MIT Press Couto, T., Cristo, M., Gonçalves, M. A., Calado, P., Ziviani, N., Moura, E., & Ribeiro-Neto, B. (2006). A comparative study of citations and links in document classification. In Proceedings of the 6th ACM/IEEE-CS joint conference on digital libraries (pp. 75–84). Cristo, M., Calado, P., Moura, E., & Nivio Ziviani, B. R. N. (2003). Link information as a similarity measure in web classification. In 10th Symposium on string processing and information retrieval SPIRE 2003, Lecture Notes in Computer Science (Vol. 2857, pp. 43–55). Dean, J., & Henzinger, M. R. (1999). Finding related pages in the World Wide Web. Computer Networks, 31(11–16), 1467–1479, also in Proceedings of the 8th international World Wide Web conference (WWW99). Egghe, L., & Rousseau, R. (1990). Introduction to informetrics: Quantitative methods in library, documentation and information science. North-Holland, Amsterdam: Elsevier. Fisher, M., & Everson, R. (2003). When are links useful? Experiments in text classification. In Proceedings of the 25th European conference on information retrieval research (pp. 41–56). Furnkranz, J. (1999). Exploiting structural information for text classification on the WWW. In Proceedings of the 3rd symposium on intelligent data analysis (IDA99) (pp. 487–498). Garfield, E. (1972). Citation analysis as a tool in journal evaluation. Science, 178(4060), 471–479. Glover, E. J., Tsioutsiouliklis, K., Lawrence, S., Pennock, D. M., & Flake, G. W. (2002) Using web structure for classifying and describing web pages. In Proceedings of the 11th international World Wide Web conference (WWW02) Gövert, N., Lalmas, M., & Fuhr, N. (1999). A probabilistic description-oriented approach for categorizing web documents. In Proceedings of the 8th international conference on information and knowledge management (pp. 475–482). Kansas City, MO, USA. Hawking, D., & Craswell, N. (2001). Overview of TREC-2001 web track. In The 10th text retrieval conference (TREC-2001) (pp. 61–67). Gaithersburg, MD, USA. Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML-98, 10th European conference on machine learning (pp. 137–142). Chemnitz, Germany Joachims, T., Cristianini, N., & Shawe-Taylor, J. (2001). Composite kernels for hypertext categorisation. In Proceedings of the 18th international conference on machine learning, ICML-01 (pp. 250–257). Kessler, M. M. (1963). Bibliographic coupling between scientific papers. American Documentation, 14(1), 10–25. Kleinberg, J. M. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604–632. Kumar, R., Raghavan, P., Rajagopalan, S., & Tomkins, A. (1999). Trawling the web for emerging cyber-communities. Computer Networks, 31(11–16), 1481–1493, also in Proceedings of the 8th international World Wide Web conference (WWW99). Larson, R. R. (1996). Bibliometrics of the World Wide Web: An exploratory analysis of the intellectual structure of cyberspace. In Annual meeting of the American Society for information science (pp. 71–78). Baltimore, MD, USA. Lawrence, S., Giles, C. L., & Bollacker, K. D. (1999). Autonomous citation matching. In O. Etzioni, J. P. Müller, & J. M. Bradshaw (Eds.) Proceedings of the 3rd annual conference on autonomous agents (AGENTS-99) (pp. 392–393). ACM Press. Li, X., Chen, H., Zhang, Z., & Li, J. (2007). Automatic patent classification using citation network information: An experimental study in nanotechnology. In Proceedings of the ACM IEEE joint conference on digital libraries (pp. 419–427). Marshakova, I. V. (1973). A system of document connection based on refernces. Scientific and Technical Information Serial of VINITI, 6(2), 3–8. Mitchell, T. (1997). Machine learning. New York: McGraw-Hill. Moed, H. F. (2005) Citation analysis in research evaluation (information science & knowledge management). Secaucus, NJ: Springer New York, Inc. Oh, H. J., Myaeng, S. H., & Lee, M. H. (2000). A practical hypertext catergorization method using links and incrementally available class information. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval (pp. 264–271). Qi, X., & Davison, B. D. (2006). Knowing a web page by the company it keeps. In Proceedings of the 15th ACM international conference on information and knowledge management (pp. 228–237). Qin, J. (2000). Semantic similarities between a keyword database and a controlled vocabulary database: An investigation in the antibiotic resistance literature. Journal of the American Society for Information Science, 51(2), 166–180. Saerens, M., Latinne, P., & Decaestecker, C. (2002). Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure. Neural Computation, 14(1), 21–41, http://www.dx.doi.org/10.1162/089976602753284446. Salton, G. (1963). Associative document retrieval techniques using bibliographic information. Journal of the ACM, 10(4), 440–457. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5), 513–523. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47. Shen, D., Sun, J. T., Yang, Q., & Chen, Z. (2006). A comparison of implicit and explicit links for web page classification. In Proceedings of the 15th international conference on World Wide Web (pp. 643–650) New York, NY, USA. Slattery, S., & Mitchell, T. (2000). Discovering test set regularities in relational domains. In Proceedings of the 17th international conference on machine learning. Stanford, CA, USA. Small, H. G. (1973). Co-citation in the scientific literature: A new measure of relationship between two documents. Journal of the American Society for Information Science, 24(4), 265–269. Smith, A. G. (2004). Web links as analogues of citations. Information Research, 9(4). Sun, A., Lim, E. P., & Ng, W. K. (2002). Web classification using support vector machine. In Proceedings of the 4th international workshop on web information and data management (pp. 96–99). Terveen, L., Hill, W., & Amento, B. (1999). Constructing, organizing, and visualizing collections of topically related web resources. ACM Transactions on Computer-Human Interaction, 6(1), 67–94. Turtle, H., & Croft, W. B. (1991). Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems, 9(3), 187–222. Veloso, A., Wagner Meira, J., Cristo, M., Gonçalves, M., & Zaki, M. (2006). Multi-evidence, multi-criteria, lazy associative document classification. In Proceedings of the 15th ACM international conference on information and knowledge management (pp. 218–227). Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 80–83. Witten, I. H., & Frank, E. (2005). Data mining, practical machine learning tools and techniques (2nd ed.). San Francisco, CA: Morgan Kaufmann. Yang, Y. (1994). Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval (pp. 13–22). Yang, Y., & Liu, X. (1999) A re-examination of text categorization methods. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval (pp. 42–49). Berkeley, CA. Yang, Y., Slattery, S., & Ghani, R. (2002). A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2), 219–241 Zhang, B., Chen, Y., Fan, W., Fox, E. A., Goncalves, M., Cristo, M., & Calado, P. (2005). Intelligent GP fusion from multiple sources for text classification. In Proceedings of the 14th ACM international conference on information and knowledge management. Bremen, Germany: ACM Press