Creation of a Russian Stop Word List
Tóm tắt
This article describes three identifying characteristics of stop words—statistical, semantic, and morphological—and postulates new principles for the creation of stop word lists based on these characteristics. The application of the principles is demonstrated by creating a Russian-language general stop word list based on the analysis of existing sources and frequency distributions in the Russian National Corpus. The resulting list contains 535 stop words.
Tài liệu tham khảo
Francis, W.N. and Kucera, H., Computational Analysis of Present Day American English, Providence: Brown Univ. Press, 1967.
Roelleke, T. and Wang, U., TF-IDF uncovered: A study of theories and probabilities (and physics), ACM SIGIR 2008, Singapore, 2008, pp. 435–442. http://www.eecs. qmul.ac.uk/~thor/2008/TF-IDF-Uncovered-SIGIR-Talk.pdf.
Yatsko, V.A., TF*IDF revisited, Int. J. Comput. Linguist. Nat. Lang. Process., 2013, vol. 2, no. 6, pp. 385–387. https://docs.google.com/file/d/0B306nMx7wiLyZ0tFelo4MzY5SWc/edit.
Yatsko, V.A., A new method of automatic text document classification, Autom. Doc. Math. Linguist., 2021, vol. 55, no. 3, pp. 122–133. https://doi.org/10.3103/S0005105521030080
Fox, C., A stop list for general text, ACM SIGIR Forum, 1989, vol. 24, nos. 1–2, pp. 19–21. https://doi.org/10.1145/378881.378888
Lyashevskaya, O.N and Sharov, S.A., Chastotnyi slovar’ sovremennogo russkogo yazyka (na materialakh Natsional’nogo korpusa russkogo yazyka) (Frequency Vocabulary of Modern Russian Language (On Materials of the National Corpus of Russian Language)), Moscow: Azbukovnik, 2009. http://dict.ruslang.ru/freq.php?act= show&dic=freq_freq&title=%D7%E0%F1%F2%EE% F2%ED%FB%E9%20%F1%EF%E8%F1%EE%EA% 20%EB%E5%EC%EC.
Zipf’s and Heap’s law, Northeastern Univ., Khoury College of Computer Sciences, 2009. https://www.ccs. neu.edu/home/ekanou/ISU535.09X2/Handouts/Review_Material/zipfslaw.pdf.
Yatsko, V.A., Rassuzhdenie kak tip nauchnoi rechi (Reasoning as a Type of Scientific Speech), Abakan: Izd-vo Khakasskogo Gos. Univ., 1998.
Porter, M., The Porter stemming algorithm, 2006. https://tartarus.org/martin/PorterStemmer/.
Savoy, J., IR multilingual resources at UniNE, 2005. http://members.unine.ch/jacques.savoy/clef/.