Using the Web as corpus for self-training text categorization

Springer Science and Business Media LLC - Tập 12 - Trang 400-415 - 2008

Rafael Guzmán-Cabrera^1,2, Manuel Montes-y-Gómez³, Paolo Rosso², Luis Villaseñor-Pineda³

¹Facultad de Ingeniería Mecánica, Electrica y Electrónica, Universidad de Guanajuato, Guanajuato, Mexico

²Natural Language Engineering Lab., Polytechnic University of Valencia, Valencia, Spain

³Laboratorio de Tecnologías del Lenguaje, Instituto Nacional de Astrofísica, Óptica y Electrónica, Tonantzintla, Mexico

Tóm tắt

Most current methods for automatic text categorization are based on supervised learning techniques and, therefore, they face the problem of requiring a great number of training instances to construct an accurate classifier. In order to tackle this problem, this paper proposes a new semi-supervised method for text categorization, which considers the automatic extraction of unlabeled examples from the Web and the application of an enriched self-training approach for the construction of the classifier. This method, even though language independent, is more pertinent for scenarios where large sets of labeled resources do not exist. That, for instance, could be the case of several application domains in different non-English languages such as Spanish. The experimental evaluation of the method was carried out in three different tasks and in two different languages. The achieved results demonstrate the applicability and usefulness of the proposed method.

Tài liệu tham khảo

Aas, K., & Eikvil, L. (1999). Text categorization: A survey. Tech. Rep. 941. Norwegian Computing Center. Argamon, S., & Levitan, S. (2005). Measuring the usefulness of function words for authorship attribution. In Proceedings of ACH/ALLC Conference 2005. Bekkerman, R., & Allan, J. (2004). Using bigrams in text categorization. Tech. Rep. IR-408. Center of Intelligent Information Retrieval, UMass Amherst. Chaski, C. (2005). Who’s at the keyboard: Authorship attribution in digital evidence investigations. International Journal of Digital Evidence, 4(1), 1–13. Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Editorial: Special issue on learning from imbalanced data sets. SIGKDD Explorations, 6(1), 1–6. Coyotl-Morales, R. M., Villaseñor-Pineda, L., Montes-Y-Gómez, M., & Rosso, P. (2006). Authorship attribution using word sequences. In J. F. Martínez-Trinidad, J. A. Carrasco-Ochoa, & J. Kittler (Eds.), CIARP (Vol. 4225, pp. 844–853). Springer, Lecture Notes in Computer Science. Diederich, J., Kindermann, J., Leopold, E., & Paass, G. (2003). Authorship attribution with support vector machines. Applied Intelligence, 19(1/2), 109–123. Hartley, H. O., & Rao, J. N. K. (1968). Classification and estimation in analysis of variance problems. Review of the International Statistical Institute, 36(2), 141–147. Holmes, D. I. (1994). Authorship attribution. Computers and the Humanities, 28, 87–106. Hoste, V. (2005). Optimization issues in machine learning of coreference resolution. Ph.D. thesis, Faculteit Letteren en Wijsbegeerte, Universiteit Antwerpen, Belgium. Joachims, T. (1999). Transductive inference for text classification using support vector machines. In Proceedings of the 16th International Conference on Machine Learning (pp. 200–209). San Francisco, CA: Morgan Kaufmann. Kaster, A., Siersdorfer, S., & Weikum, G. (2005). Combining text and linguistic document representations for authorship attribution. In SIGIR Workshop: Stylistic Analysis of Text for Information Access (STYLE) (pp. 27–35). Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the special issue of the Web as corpus. Computational Linguistics, 29(2), 333–347. Malyutov, M. B. (2006). Authorship attribution of texts: A review. In R. Ahlswede, L. Bäumer, N. Cai, H. K. Aydinian, V. Blinovsky, C. Deppe, & H. Mashurian (Eds.), GTIT-C (Vol. 4123, pp. 362–380). Springer, Lecture Notes in Computer Science. Moschitti, A., & Basili, R. (2004). Complex linguistic features for text classification: A comprehensive study. In S. McDonald & J. Tait (Eds.), Proceedings of the 26th European Conference on Information Retrieval (ECIR 2004) (Vol. 2997, pp. 181–196). Sunderland, UK: Springer, Lecture Notes in Computer Science. Nigam, K., Mccallum, A. K., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3), 103–134. Peng, F., Schuurmans, D., Wang, S. (2004). Augmenting naive Bayes classifiers with statistical language models. Information Retrieval, 7(3–4), 317–345. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47. Seeger, M. (2000). Learning with labeled and unlabeled data. Tech. Rep. Edinburgh, UK: University of Edinburgh. Smucker, M., Allan, J., & Carterette, B. (2007). A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the ACM Sixteenth Conference on Information and Knowledge Management (pp. 623–632). Solorio, T. (2002). Using unlabeled data to improve classifier accuracy. Master’s thesis, Computer Science Department, INAOE, Mexico. Stamatatos, E., Fakotakis, N., & Kokkinakis, G. (2001). Computer-based authorship attribution without lexical measures. Computers and the Humanities, 35, 193–214. Witten, I. H., & Frank, E. (1999). Data mining: Practical machine learning tools and techniques with Java implementations. Morgan Kaufmann. Yu, B. (2006). An evaluation of text classification methods for literary study. Ph.D. thesis, Champaign, IL, USA. Zelikovitz, S., & Hirsh, H. (2002). Integrating background knowledge into nearest-neighbor text classification. In S. Craw & A. D. Preece (Eds.), ECCBR (Vol. 2416, pp. 1–5). Springer, Lecture Notes in Computer Science. Zelikovitz, S., & Kogan, M. (2006). Using web searches on important words to create background sets for LSI classification. In G. Sutcliffe & R. Goebel (Eds.), FLAIRS Conference (pp. 598–603). AAAI Press. Zhao, Y., & Zobel, J. (2005). Effective and scalable authorship attribution using function words. In G. G. Lee, A. Yamada, H. Meng, & S. H. Myaeng (Eds.), AIRS (Vol. 3689, pp. 174–189). Springer, Lecture Notes in Computer Science. Zhu, X. (2005). Semi-supervised learning literature survey. Tech. Rep. Computer Sciences, University of Wisconsin-Madison.

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA