A comparison of text‐classification techniques applied to Arabic text

Wiley - Tập 60 Số 9 - Trang 1836-1844 - 2009
Ghassan Kanaan1, Riyad Al‐Shalabi1, Sameh Ghwanmeh2, Hamda Al‐Ma'adeed1
1Arab Academy for Banking and Financial Services, Amman, Jordan
2Computer Engineering Department, Yarmouk University, Jordan

Tóm tắt

Abstract

Many algorithms have been implemented for the problem of text classification. Most of the work in this area was carried out for English text. Very little research has been carried out on Arabic text. The nature of Arabic text is different than that of English text, and preprocessing of Arabic text is more challenging. This paper presents an implementation of three automatic text‐classification techniques for Arabic text. A corpus of 1445 Arabic text documents belonging to nine categories has been automatically classified using the kNN, Rocchio, and naïve Bayes algorithms. The research results reveal that Naïve Bayes was the best performer, followed by kNN and Rocchio.

Từ khóa


Tài liệu tham khảo

10.1145/584792.584848

Bergo A.(2001). Text categorization and prototypes. Retrieved June 3 2009 fromhttp://www.illc.uva.nl/Publications/ResearchReports/MoL‐2001‐08.text.pdf

10.1145/243199.243278

10.1007/978-3-540-24630-5_69

Ho Y., 1998, In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 81

10.1007/978-1-4471-2099-5_29

10.3115/974358.974395

Joachims T., 1997, Proceedings of the 14th International Conference on Machine Learning (ICML‐97), 143

10.1007/BFb0026683

10.1108/eb026526

10.1109/ICDAR.1999.791887

Koster C.H.A., 2003, Lecture Notes in Computer Science, Vol. 2890: Perspectives of System Informatics, 111

10.1145/564376.564425

Lewis D., 1991, Evaluating text categorization. In Proceedings of the Workshop on Speech and Natural Language, 312

10.1007/BFb0026666

10.1007/978-1-4471-2099-5_1

Lewis D. &Ringuette M.(1994).A comparison of two learning algorithms for text categorization. Paper presented at the Third Annual Symposium on Document Analysis and Information Retrieval Las Vegas NV.

Manning D., 2006, An introduction to information retrieval [Preliminary draft]

McCallum A. &Nigam K.(1998). A comparison of event models for naïve Bayes text classification. In AAAI Workshop on Learning for Text Categorization. Retrieved June 3 2009 fromhttp://www.cs.cmu.edu/∼knigam/papers/multinomial‐aaaiws98.pdf

Mitchell T., 1996, Machine learning

Rocchio J., 1971, The SMART Retrieval System: Experiments in Automatic Document Processing, 313

Salton G, 1983, Introduction to modern information retrieval

10.1108/eb026562

10.1145/290941.290996

10.3115/1219044.1219068

10.1145/215206.215365

Sebastiani F.(1999).A tutorial on automated text categoriation. Paper presented at the European Symposium on Telematics Hypermedia and Artificial Intelligence (THAI‐99) Varese Italy.

10.1145/505282.505283

10.2495/978-1-85312-995-7/04

Shankar S. &Karypis G.(2000). Weight adjustment schemes for a centroid‐based classifier. Retrived June 3 2009 from the University of Minnesota Web site:http://glaros.dtc.umn.edu/gkhome/node/160

Tokunaga T. &Iwayama M.(1994). Text categorization based on weighted inverse document frequency. Retrieved June 3 2009 from the Department of Computer Science Tokyo Institute of Technology Web site:http://tanaka‐www.cs.titech.ac.jp/publication/archive/142.pdf

10.1023/A:1009982220290

10.1145/312624.312647