Using the Web as corpus for self-training text categorization
Tóm tắt
Most current methods for automatic text categorization are based on supervised learning techniques and, therefore, they face the problem of requiring a great number of training instances to construct an accurate classifier. In order to tackle this problem, this paper proposes a new semi-supervised method for text categorization, which considers the automatic extraction of unlabeled examples from the Web and the application of an enriched self-training approach for the construction of the classifier. This method, even though language independent, is more pertinent for scenarios where large sets of labeled resources do not exist. That, for instance, could be the case of several application domains in different non-English languages such as Spanish. The experimental evaluation of the method was carried out in three different tasks and in two different languages. The achieved results demonstrate the applicability and usefulness of the proposed method.
Tài liệu tham khảo
Aas, K., & Eikvil, L. (1999). Text categorization: A survey. Tech. Rep. 941. Norwegian Computing Center.
Argamon, S., & Levitan, S. (2005). Measuring the usefulness of function words for authorship attribution. In Proceedings of ACH/ALLC Conference 2005.
Bekkerman, R., & Allan, J. (2004). Using bigrams in text categorization. Tech. Rep. IR-408. Center of Intelligent Information Retrieval, UMass Amherst.
Chaski, C. (2005). Who’s at the keyboard: Authorship attribution in digital evidence investigations. International Journal of Digital Evidence, 4(1), 1–13.
Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Editorial: Special issue on learning from imbalanced data sets. SIGKDD Explorations, 6(1), 1–6.
Coyotl-Morales, R. M., Villaseñor-Pineda, L., Montes-Y-Gómez, M., & Rosso, P. (2006). Authorship attribution using word sequences. In J. F. Martínez-Trinidad, J. A. Carrasco-Ochoa, & J. Kittler (Eds.), CIARP (Vol. 4225, pp. 844–853). Springer, Lecture Notes in Computer Science.
Diederich, J., Kindermann, J., Leopold, E., & Paass, G. (2003). Authorship attribution with support vector machines. Applied Intelligence, 19(1/2), 109–123.
Hartley, H. O., & Rao, J. N. K. (1968). Classification and estimation in analysis of variance problems. Review of the International Statistical Institute, 36(2), 141–147.
Holmes, D. I. (1994). Authorship attribution. Computers and the Humanities, 28, 87–106.
Hoste, V. (2005). Optimization issues in machine learning of coreference resolution. Ph.D. thesis, Faculteit Letteren en Wijsbegeerte, Universiteit Antwerpen, Belgium.
Joachims, T. (1999). Transductive inference for text classification using support vector machines. In Proceedings of the 16th International Conference on Machine Learning (pp. 200–209). San Francisco, CA: Morgan Kaufmann.
Kaster, A., Siersdorfer, S., & Weikum, G. (2005). Combining text and linguistic document representations for authorship attribution. In SIGIR Workshop: Stylistic Analysis of Text for Information Access (STYLE) (pp. 27–35).
Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the special issue of the Web as corpus. Computational Linguistics, 29(2), 333–347.
Malyutov, M. B. (2006). Authorship attribution of texts: A review. In R. Ahlswede, L. Bäumer, N. Cai, H. K. Aydinian, V. Blinovsky, C. Deppe, & H. Mashurian (Eds.), GTIT-C (Vol. 4123, pp. 362–380). Springer, Lecture Notes in Computer Science.
Moschitti, A., & Basili, R. (2004). Complex linguistic features for text classification: A comprehensive study. In S. McDonald & J. Tait (Eds.), Proceedings of the 26th European Conference on Information Retrieval (ECIR 2004) (Vol. 2997, pp. 181–196). Sunderland, UK: Springer, Lecture Notes in Computer Science.
Nigam, K., Mccallum, A. K., Thrun, S., & Mitchell, T. (2000). Text classification from labeled and unlabeled documents using EM. Machine Learning, 39(2/3), 103–134.
Peng, F., Schuurmans, D., Wang, S. (2004). Augmenting naive Bayes classifiers with statistical language models. Information Retrieval, 7(3–4), 317–345.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.
Seeger, M. (2000). Learning with labeled and unlabeled data. Tech. Rep. Edinburgh, UK: University of Edinburgh.
Smucker, M., Allan, J., & Carterette, B. (2007). A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the ACM Sixteenth Conference on Information and Knowledge Management (pp. 623–632).
Solorio, T. (2002). Using unlabeled data to improve classifier accuracy. Master’s thesis, Computer Science Department, INAOE, Mexico.
Stamatatos, E., Fakotakis, N., & Kokkinakis, G. (2001). Computer-based authorship attribution without lexical measures. Computers and the Humanities, 35, 193–214.
Witten, I. H., & Frank, E. (1999). Data mining: Practical machine learning tools and techniques with Java implementations. Morgan Kaufmann.
Yu, B. (2006). An evaluation of text classification methods for literary study. Ph.D. thesis, Champaign, IL, USA.
Zelikovitz, S., & Hirsh, H. (2002). Integrating background knowledge into nearest-neighbor text classification. In S. Craw & A. D. Preece (Eds.), ECCBR (Vol. 2416, pp. 1–5). Springer, Lecture Notes in Computer Science.
Zelikovitz, S., & Kogan, M. (2006). Using web searches on important words to create background sets for LSI classification. In G. Sutcliffe & R. Goebel (Eds.), FLAIRS Conference (pp. 598–603). AAAI Press.
Zhao, Y., & Zobel, J. (2005). Effective and scalable authorship attribution using function words. In G. G. Lee, A. Yamada, H. Meng, & S. H. Myaeng (Eds.), AIRS (Vol. 3689, pp. 174–189). Springer, Lecture Notes in Computer Science.
Zhu, X. (2005). Semi-supervised learning literature survey. Tech. Rep. Computer Sciences, University of Wisconsin-Madison.