Hierarchization of Topical Texts Based on the Estimate of Proximity to the Semantic Pattern without Paraphrasing

Pattern Recognition and Image Analysis - Tập 30 - Trang 440-449 - 2020
D. V. Mikhaylov1, G. M. Emelyanov1
1Yaroslav the Wise Novgorod State University, Veliky Novgorod, Russia

Tóm tắt

The paper is devoted to the problem of numerically estimating the mutual semantic dependence of topical texts with respect to the most rational (i.e., standard) variants for describing the knowledge fragments they represent. The proximity of the text to the standard is evaluated without searching for paraphrases. This problem is relevant in determining the significance of information sources regarding tasks performed by the user. At this point, an example is the search for the optimal order of working with primary sources in the formation of the individual educational trajectory of a student. In the proposed solution, the basis for assessing the proximity of a text to the standard is the division of the words of each of its phrases into classes according to the value of the TF-IDF measure relative to the texts of the corpus, which was previously formed by an expert. The analyzed texts are the abstracts of scientific articles together with their titles. The principles of ranking and subsequent hierarchization of texts of an original collection based on the assessment variants relative to the title and phrase with the closest proximity to the standard are considered. The semantic images of the texts that are the closest to the standard are determined by the words with the highest TF-IDF values, which, when located next to each other in a linear row of a phrase, are most likely related by meaning and form key combinations together with the words that are close to the average value of the specified measure. An analysis of the occurrence of words with the highest TF-IDF values in different texts of the collection assesses the relationship of their standards as the basis for assessing the complementarity of texts in meaning.

Tài liệu tham khảo

D. V. Mikhaylov and G. M. Emelyanov, “Estimation of the closeness to a semantic pattern of a topical text without construction of periphrases,” Pattern Recogn. Image Anal. 29 (4), 647–653 (2019). Yu. O. Trusova and V. N. Beloozerov, “Representation of classification systems in the form of ontologies (review),” Nauchno-Tekh. Inf. Ser. 1 (Scientific and Technical Information. Ser. 1. Organization and Methods of Information Work), No. 11, 34–38 (2015) [in Russian]. A. Ianina and K. Vorontsov, “Regularized multimodal hierarchical topic model for document-by-document exploratory search,” in Mathematical Methods for Pattern Recognition (MMPR-2019): Book of Abstracts of the 19th All-Russian Conference with International Participation (Moscow, 2019) (Russian Academy of Sciences, Moscow, 2019), pp. 256–258. A. Kuzmin, A, Aduenko, and V. Strijov, “Thematic classification using expert model for major conference abstracts,” Inf. Tekhnol. 20 (6), 22–26 (2014) [in Russian]. M. Eremeev and K. Vorontsov, “Lexical quantile-based text complexity measure,” in Proc. Int. Conf. on Recent Advances in Natural Language Processing (RANLP 2019) (Varna, Bulgaria, September 2–4, 2019), pp. 270–275. G. M. Emelyanov, D. V. Mikhaylov, and A. P. Kozlov, “Formation of the representation of topical knowledge units in the problem of their estimation on the basis of open tests,” Mash. Obuch. Anal. Dannykh (Mach. Learn. Data Anal.) 1 (8), 1089–1106 (2014) [in Russian]. N. Yu. Korneeva, D. N. Korneev, A. A. Loskutov, and N. V. Uvarina, “The technology of modular education as a tool for the creation of individual educational trajectory of the student,” Vestn. Chelyab. Gos. Pedagog. Univ. (Herald of the Chelyabinsk State Pedagogical University), No. 7, 49–55 (2016) [in Russian]. D. Mikhaylov and G. Emelyanov, “Estimation by phrases for the closeness of a topical text to the semantic pattern without paraphrasing,” in Proc. 14th Int. Conf. on Interactive Systems: Problems of Human-Computer Interaction (IS-2019) (Ulyanovsk, Russia, September 24-27, 2019), pp. 23–31. Available at: http://ceur-ws.org/Vol-2475/paper2.pdf. N. G. Zagoruiko, Applied Methods of Data and Knowledge Analysis (Institute of Mathematics SD RAS, Novosibirsk, 1999) [in Russian]. M. Sahlgren, “The distributional hypothesis,” From Context to Meaning: Distributional Models of the Lexicon in Linguistics and Cognitive Science: Special issue of the Italian Journal of Linguistics, Rivista di Linguistica 20 (1), 33–53 (2008). D. V. Mikhaylov, A. P. Kozlov, and G. M. Emelyanov, “An approach based on TF-IDF metrics to extract the knowledge and their linguistic forms of expression on the subject-oriented text set,” Comput. Opt. 39 (3), 429–438 (2015) [in Russian]. The Eclipse Foundation. Available at: https://www.eclipse.org. G. M. Emelyanov, D. V. Mikhailov, and A. P. Kozlov, “Relevance of a set of topical texts to a knowledge unit and the estimation of the closeness of linguistic forms of its expression to a semantic pattern,” Pattern Recogn. Image Anal. 28 (4), 771–782 (2018). PDFMiner — Python PDF parser and analyzer. Available at: https://euske.github.io/pdfminer/. Natural Language Toolkit. Available at: http://www.nltk.org/. M. Korobov, “Morphological analyzer and generator for Russian and Ukrainian languages,” in Analysis of Images, Social Networks and Texts, AIST 2015, Ed. by M. Yu. Khachay, Communications in Computer and Information Science (Springer, Cham, 2018), Vol. 542, pp. 320–332. A. D. Moskvina, D. Orlova, P. V. Panicheva, and O. A. Mitrofanova, “Development of the Core for Syntactic Parser for Russian based on NLTK libraries,” in Computer Linguistics and Computational Ontologies, Proc. XIX International Joint Scientific Conference “Internet and Modern Society” (IMS-2016) (St. Petersburg, 2016), pp. 44–54 [In Russian]. G. M. Adel’son-Vel’skii and E. M. Landis, “An algorithm for organization of information,” Dokl. Akad. Nauk SSSR 146 (2), P. 263–266 (1962) [In Russian].