Document clustering of scientific texts using citation contexts

Springer Science and Business Media LLC - Tập 13 Số 2 - Trang 101-131 - 2010
Bader Aljaber1, Nicola Stokes2, James Bailey3, Jian Pei4
1Department of Computer Science and Software Engineering, The University of Melbourne, Melbourne, Australia
2School of Computer Science and Informatics, University College Dublin, Dublin, Ireland
3NICTA Victoria Laboratory, Department of Computer Science and Software Engineering, The University of Melbourne, Melbourne, Australia
4School of Computing Science, Simon Fraser University, Burnaby, Canada

Tóm tắt

Từ khóa


Tài liệu tham khảo

Aas, K., & Eikvil, L. (1999). Text categorisation: A survey. Technical Report NR 941, Norwegian Computing Center, June.

Angelova, R., & Siersdorfer, S. (2006). A neighborhood-based approach for clustering of linked document collecitons. In Proceedings of the 15th ACM conference on Information and knowledge management (pp. 778–779).

Bekkerman, R., El-Yaniv, R., Tishby, N., & Winter, Y. (2003). Distributional word clusters vs. words for text categorization. Journal of Machine Learning Research, 3, 1183–1208.

Bergmark, D. (2000). Automatic extraction of reference linking information from online documents. Technical Report CSTR 2000-1821, Cornell Digital Library Research Group.

Bergmark D., Phempoonpanich P., & Zhao, S. (2001). Scraping the ACM digital library. SIGIR Forum, 35(2), 1–7

Bradshaw, S. (2001) Document indexing vocabularies: Reference vs content. Northwestern University (Technical Report, NWU-CS-01-7).

Bradshaw, S. (2002). Reference directed indexing: Indexing scientific literature in the context of its use. Ph.D. dissertation, Northwestern University (Technical Report, NWU-CS-02-7).

Bradshaw, S. (2003). Reference directed indexing: Redeeming relevance for subject search in citation indexes. In Proceedings of the 7th European conference on research and advanced technology for digital libraries (pp. 499–510).

Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. In Proceedings of the seventh International conference on world wide web (pp. 107–117).

Chik, F., Luk, R., & Chung, K. (2005). Text categorization based on subtopic clusters. Natural Language Processing and Information Systems, 3513, 203–214.

Councill, I. G., Giles, C. L., & Kan, M. Y. (2008). Parscit: An open-source crf reference string parsing package. In Proceedings of language resources and evaluation conference (LREC 08).

Dash, M., & Liu, H. (2000). Feature selection for clustering. In Proceedings of The Pacific-Asia conference on knowledge discovery and data mining (PAKDD) (pp. 110–121).

Dhillon, I., Kogan, J., & Nicholas M. (2004). Feature selection and document clustering. Survey of text mining (pp. 73–100). New York: Springer.

Dhillon, I., Guan, Y., & Kulis, B. (2007). Weighted graph cuts without eigenvectors: A multilevel approach. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 29(11), 1944–1957.

Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Proceedings of the ACM conference on information and knowledge management (pp. 148–155).

Elkiss, A., Shen, S., Fader, A., Erkan, G., States, D. J., & Radev, D. R. (2008). Blind men and elephants: What do citation summaries tell us about a research article? JASIST, 59(1), 51–62.

Furnas, G., Landauer, T., Gomez, L., & Dumais, S. (1987). The vocabulary problem in human-system communication. Communications of the ACM, 30(11), 964–971.

Gabrilovich, E., & Markovitch, S. (2006). Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge. In Proceedings of the twenty-first AAAI conference on artificial intelligence (pp. 1301–1306).

Garfield, E. (1964). Science citation index, a new dimension in indexing. Science, 144(3619), 649–654.

Giles, C., Bollacker, K., & Lawrence, S. (1998). Citeseer: An automatic citation indexing system. In Proceedings of the third ACM conference on digital libraries, June 1998 (pp. 89–98).

Glover, E., Tsioutsiouliklis, K., Lawrence, S., Pennock, D., & Flake, G. (2002). Using web structure for classifying and describing web pages. In Proceedings of the world wide web conference (pp. 562–569).

Hartigan, J., & Wong, M. (1979). A k-means clustering algorithm. Applied Statistics, 28, 100–108

Haveliwala, T., Gionis, A., Klein, D., & Indyk, P. (2002). Evaluating strategies for similarity search on the web. In Proceedings of the world wide web conference (pp. 432–442).

Hunter, L., & Cohen, K. (2006). Biomedical language processing: What’s beyond pubmed?. Molecular Cell, 21(5), 589–594.

Kao, H.-Y., Chen, M.-S., Lin, S.-H., & Ho, J.-M. (2002). Entropy-based link analysis for mining web informative structures. In Proceedings of the ACM conference on information and knowledge management (pp. 574–581).

Kleinberg, J. (1999) Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604–632.

Krovetz, R., & Croft, W. (1992). Lexical ambiguity and information retrieval. ACM Transactions on Information Systems, 10(2), 115–141.

Kull, M., & Vilo, J. (2008). Fast approximate hierarchical clustering using similarity heuristics. BioData Mining, 9, 1.

Lawrence, S., Giles, C., & Bollacker, K. (1999). Digital libraries and autonomous citation indexing. IEEE Computer, 32(6), 67–71.

Liu, X., & Croft, W. B. (2004). Cluster-based retrieval using language models. In SIGIR 2004: proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, Sheffield, UK (pp. 186–193).

Liu, T., Liu, S., Chen, Z., & Ma, W. (2003). An evaluation on feature selection for text clustering. In Proceedings of the twentieth international conference on machine learning (ICML), Washington, DC (pp. 488–495).

Liu, J., Paulsen, S., Sun, X., Wang, W., Nobel, A., & Prins, J. (2006). Mining approximate frequent itemsets in the presence of noise: Algorithm and analysis. In Proceedings of the 6th SIAM international conference on data mining (SDM) (pp. 405–416).

Madeira, S., & Oliveira, A. (2004). Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1(1), 24–45.

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.

Mercer, R., & Marco, C. D. (2004). A design methodology for a biomedical literature indexing tool using the rhetoric of science. In Proceedings of the bioLink workshop in conjunction with human language technology conference/North American chapter of the association for computational linguistics annual meeting (HLT/NAACL) (pp. 77–84).

Moravcsik, M., & Murugesan, P. (1975). Some results on the function and quality of citations. Social Studies of Science, 5, 86—92.

Nakov, P., Schwartz, A., & Hearst, M. (2004). Citances: Citation sentences for semantic analysis of bioscience text. In Proceedings of the SIGIR’04 workshop on search and discovery in bioinformatics (pp. 81–88).

Nanba, H., & Okumura, M. (2005) Automatic detection of survey articles. In A. Rauber, S. Christodoulakis, & A. M. Tjoa (Eds.), Research and advanced technology for digital libraries, 9th European conference, ECDL, Proceedings, September 18–23, 2005. Lecture Notes in Computer Science (Vol. 3652, pp. 391–401). Vienna, Austria: Springer

Nanba, H., Kando, N., & Okumura, M. (1999). Towards multi paper summarization using reference information. In Proceedings of the 16th international joint conferences on artificial intelligence (IJCAI-99) (pp. 926–931).

Nanba, H., Kando, N., & Okumura, M. (2000). Classification of research papers using citation links and citation types: Towards automatic review article generation. In Proceedings of the The American Society for Information Science (ASIS)/the 11th SIG classification research workshop, classification for user support and learning, 2000, Chicago, USA (pp. 117–134).

Nanba, H., Abekawa, T., Okumura, M., & Saito, S. (2004). Bilingual presri integration of multiple research paper databases. In Proceedings of RIAO (pp. 195–211).

Porter, M. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.

Powley, B., & Dale, R. (2007) Evidence-based information extraction for high-accuracy citation extraction and author name recognition. In Proceedings of the 8th RIAO international conference on large-scale semantic access to content.

Ritchie, A., Teufel, S., & Robertson, S. (2006). How to find better index terms through citations. In Proceedings of the workshop on how can computational linguistics improve information retrieval?, Sydney (pp. 25–32).

Ritchie, A., Robertson, S., & Teufel, S. (2008a). Comparing citation contexts for information retrieval. In J. G. Shanahan, S. Amer-Yahia, I. Manolescu, Y. Zhang, D. A. Evans, A. Kolcz, K.-S. Choi, & A. Chowdhury (Eds.), Proceedings of the 17th ACM conference on information and knowledge management, CIKM 2008, October 26–30, 2008 (pp. 213–222). Napa Valley, CA, USA: ACM.

Ritchie, A., Teufel, S., & Robertson, S. (2008b). Using terms from citations for information retrieval: Some first results. In Proceedings of the 30th European conference on information retrieval (ECIR) (pp. 211–221).

Robertson, S., Zaragoza, H., & Taylor, M. (2004). Simple bm25 extension to multiple weighted fields. In Proceedings of the thirteenth ACM international conference on information and knowledge management (CIKM), 2004 (pp. 42–49). New York, NY, USA: ACM.

Salton, G. (1971) The SMART retrieval system—experiments in automatic document processing. Upper Saddle River, NJ: Prentice-Hall, Inc.

Siddharthan, A., & Teufel, S. (2007). Whose idea was this and why does it matter? attributing scientific work to citations. In Proceedings of the annual conference of the North American chapter of the association for computational linguistics (NAACL-HLT) (pp. 316–323).

Slonim, N., & Tishby, N. (2000). Document clustering using word clusters via the information bottleneck method. In Proceedings of the international ACM SIGIR conference on research and development in information retrieval (pp. 208–215).

Small, H., & Sweeney, E. (1985). Clustering the science citation index using co-citations. Scienrometrics, 7(3-6), 391–409.

Tang, B., Shepherd, M., Milios, E., & Heywood, M. (2004). Comparing and combining dimension reduction techniques for efficient test clustering. In Proceedings of the workshop on feature selection for data mining, SIAM international conference on data mining (SDM) (pp. 17–26).

Teufel, S., & Moens, M. (2002). Summarizing scientific articles: Experiments with relevance and rhetorical status. Computational Linguistics, 28(4), 409–445.

Teufel, S., Siddharthan, A., & Tidhar, D. (2006). Automatic classification of citation function. In Proceedings of EMNLP-06.

Voorhees, E. (1986). Implementing agglomerative hierarchic clustering algorithms for use in document retrieval. Information Processing and Management, 22(6), 465–476.

Wang, Y., & Kitsuregawa, M. (2002). Evaluating contents-link coupled web page clustering for web search results. In Proceedings of the ACM conference on information and knowledge management (CIKM) (pp. 499–506).

Wang, Y., & Kitsuregawa, M. (2004). Enhancing contents-link coupled web page clustering and its evaluation. In Proceedings of data engineering workshop (DEWS2004).

White, H. (2004). Citation analysis and discourse analysis revisited. Applied Linguistics, 25(1), 89–116.

Wyse, N., Dubes, R., & Jain, A. (1980). A critical evaluation of intrinsic dimensionality algorithms. In E. S. Gelsema & L. N. Kanal (Eds.), Pattern recognition in practice (pp. 415–425). North-Holland Inc.

Yang, Y., & Pedersen, J. (1997). A comparative study on feature selection in text categorization. In Proceedings of the international conference on machine learning (pp. 412–420).