Extracting the roots of Arabic words without removing affixes

Journal of Information Science - Tập 40 Số 3 - Trang 376-385 - 2014
Qussai Yaseen1, Ismail Hmeidi2
1Yarmouk University, Jordan
2Jordan University of Science and Technology, Jordan#TAB#

Tóm tắt

Most research in Arabic roots extraction focuses on removing affixes from Arabic words. This process adds processing overhead and may remove non-affix letters, which leads to the extraction of incorrect roots. This paper advises a new approach to dealing with this issue by introducing a new algorithm for extracting Arabic words’ roots. The proposed algorithm, which is called the Word Substring Stemming Algorithm, does not remove affixes during the extraction process. Rather, it is based on producing the set of all substrings of an Arabic word, and uses the Arabic roots file, the Arabic patterns file and a concrete set of rules to extract correct roots from substrings. The experiments have shown that the proposed approach is competitive and its accuracy is 83.9%, Furthermore, its accuracy can be enhanced more in the sense that, for about 9.9% of the tested words, the WSS algorithm retrieves two candidates (in most cases) for the correct root.

Từ khóa


Tài liệu tham khảo

Duwairi R, 2007, The International Arab Journal of Information Technology, 4, 125

Chowdhury A, 2002, Linear combinations based on document structure and varied stemming for Arabic retrieval

10.1109/ITCC.2005.90

10.1109/AICCSA.2007.370899

10.1145/1460027.1460030

Khoja S, Garside R. Stemming Arabic text, http://zeus.cs.pacificu.edu/shereen/research.htm. (2008, accessed 1 September 2013).

10.1002/asi.10368

Beesley K, 1998, The 6th international conference and exhibition on multilingual computing

Al-Fedaghi S, 1989, The 11th national computer conference and exhibition

Mayfield J, 2001, TREC 2001

10.3115/1075218.1075244

10.3115/1075096.1075146

Harmanani H, 2006, The International Arab Journal of Information Technology, 3, 265

Chen A, 2002, TREC 2002

Kadri Y, 2006, The challenge of Arabic for NLP/MT conference

Boudlal A, 2011, International Arab Journal of Information Technology, 8, 91

10.1109/ICCTD.2010.5645872

Al-Ameed H. A proposed new model using a light stemmer for increasing the success of search in Arabic terms. PhD Thesis, University of Bradford, Bradford, 2006.

Hmeidi I, 2010, Journal of the American Society for Information Science and Technology, 61, 583, 10.1002/asi.21247

Al-Kabi M, 2006, The international Arab conference on information technology

Al-Sarhan H, 2003, The 2003 Arab conference on information technology

Ghawanmeh S, 2005, The 5th international conference of the Business Information Management Association

10.1177/0165551510392305