A comparison of text‐classification techniques applied to Arabic text
Tóm tắt
Many algorithms have been implemented for the problem of text classification. Most of the work in this area was carried out for English text. Very little research has been carried out on Arabic text. The nature of Arabic text is different than that of English text, and preprocessing of Arabic text is more challenging. This paper presents an implementation of three automatic text‐classification techniques for Arabic text. A corpus of 1445 Arabic text documents belonging to nine categories has been automatically classified using the kNN, Rocchio, and naïve Bayes algorithms. The research results reveal that Naïve Bayes was the best performer, followed by kNN and Rocchio.
Từ khóa
Tài liệu tham khảo
Bergo A.(2001). Text categorization and prototypes. Retrieved June 3 2009 fromhttp://www.illc.uva.nl/Publications/ResearchReports/MoL‐2001‐08.text.pdf
Ho Y., 1998, In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 81
Joachims T., 1997, Proceedings of the 14th International Conference on Machine Learning (ICML‐97), 143
Koster C.H.A., 2003, Lecture Notes in Computer Science, Vol. 2890: Perspectives of System Informatics, 111
Lewis D., 1991, Evaluating text categorization. In Proceedings of the Workshop on Speech and Natural Language, 312
Lewis D. &Ringuette M.(1994).A comparison of two learning algorithms for text categorization. Paper presented at the Third Annual Symposium on Document Analysis and Information Retrieval Las Vegas NV.
Manning D., 2006, An introduction to information retrieval [Preliminary draft]
McCallum A. &Nigam K.(1998). A comparison of event models for naïve Bayes text classification. In AAAI Workshop on Learning for Text Categorization. Retrieved June 3 2009 fromhttp://www.cs.cmu.edu/∼knigam/papers/multinomial‐aaaiws98.pdf
Mitchell T., 1996, Machine learning
Rocchio J., 1971, The SMART Retrieval System: Experiments in Automatic Document Processing, 313
Salton G, 1983, Introduction to modern information retrieval
Sebastiani F.(1999).A tutorial on automated text categoriation. Paper presented at the European Symposium on Telematics Hypermedia and Artificial Intelligence (THAI‐99) Varese Italy.
Shankar S. &Karypis G.(2000). Weight adjustment schemes for a centroid‐based classifier. Retrived June 3 2009 from the University of Minnesota Web site:http://glaros.dtc.umn.edu/gkhome/node/160
Tokunaga T. &Iwayama M.(1994). Text categorization based on weighted inverse document frequency. Retrieved June 3 2009 from the Department of Computer Science Tokyo Institute of Technology Web site:http://tanaka‐www.cs.titech.ac.jp/publication/archive/142.pdf