Systematic Homonym Detection and Replacement Based on Contextual Word Embedding

Springer Science and Business Media LLC - Tập 53 - Trang 17-36 - 2020
Younghoon Lee1
1Department of Industrial Engineering, Seoul National University of Science and Technology, Nowon-gu, Seoul, Republic of Korea

Tóm tắt

Homonyms are words that share their spelling but differ in meaning and are a common feature in most languages. Homonyms are a source of noise i most text analyses and are difficult to detect; numerous studies have been conducted in this regard. However, extant methods typically detect homonyms using a rule-based or statistical-based approach, which requires an answer set, with little regard to the semantic meaning of the word. Therefore, we propose a novel approach for the detection of homonyms based on contextual word embedding that allows a word to be understood based on the context in which it appears. In this study, we extracted all contextual word embedding vectors of individual words and clustered those vectors using a spherical k-means clustering to detect pairs of homonyms. In addition, we developed a homonym replacement method to increase the performance of a document embedding technique, based on the word vector value. We replaced the embedding vectors of homonyms with a representative vector based on the respective meaning using the proposed homonym detection method. Experimental results indicate that the proposed method effectively detects homonyms and significantly improves the performance of document embedding.

Tài liệu tham khảo

An Y, Liu S, Wang H (2020) Error detection in a large-scale lexical taxonomy. Information 11(2):97 Balazs JA, Velásquez JD (2016) Opinion mining and information fusion: a survey. Inf Fusion 27:95–110 Bhardwaj P, Khosla P (2017) Review of text mining techniques. IITM J Manag IT 8(1):27–31 Buchta C, Kober M, Feinerer I, Hornik K (2012) Spherical k-means clustering. J Stat Softw 50(10):1–22 Correia RA, Jepson P, Malhado AC, Ladle RJ (2017) Internet scientific name frequency as an indicator of cultural salience of biodiversity. Ecol Indic 78:549–555 Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 Ferreira AA, Veloso A, Gonçalves MA, Laender AH (2014) Self-training author name disambiguation for information scarce scenarios. J Assoc Inf Sci Technol 65(6):1257–1278 Harris ZS (1954) Distributional structure. Word 10(2–3):146–162 Heo Y, Kang S, Seo J (2020) Hybrid sense classification method for large-scale word sense disambiguation. IEEE Access 8:27247–27256 Hong C, Yu J, Tao D, Wang M (2014) Image-based three-dimensional human pose recovery by multiview locality-sensitive sparse retrieval. IEEE Trans Ind Electron 62(6):3742–3751 Hong C, Yu J, Wan J, Tao D, Wang M (2015) Multimodal deep autoencoder for human pose recovery. IEEE Trans Image Process 24(12):5659–5670 Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218 Kågebäck M, Salomonsson H (2016) Word sense disambiguation using a bidirectional lstm. arXiv preprint arXiv:1606.03568 Kim HK, Kim H, Cho S (2017) Bag-of-concepts: comprehending document representation through clustering words in distributed representation. Neurocomputing 266:336–352 Kim Y (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 Ladle RJ, Correia RA, Do Y, Joo GJ, Malhado AC, Proulx R, Roberge JM, Jepson P (2016) Conservation culturomics. Front Ecol Environ 14(5):269–275 Lee Y, Im J, Cho S, Choi J (2018) Applying convolution filter to matrix of word-clustering based document representation. Neurocomputing 315:210–220 Lee Y, Song S, Cho S, Choi J (2019) Document representation based on probabilistic word clustering in customer-voice classification. Pattern Anal Appl 22(1):221–232 Liu P, Qiu X, Huang X (2016) Recurrent neural network for text classification with multi-task learning. arXiv preprint arXiv:1605.05101 Liu W, Islamaj Doğan R, Kim S, Comeau DC, Kim W, Yeganova L, Lu Z, Wilbur WJ (2014) Author name disambiguation for pub med. J Assoc Inf Sci Technol 65(4):765–781 Louppe G, Al-Natsheh HT, Susik M, Maguire EJ (2016) Ethnicity sensitive author disambiguation using semi-supervised learning. In: International conference on knowledge engineering and the semantic web. Springer, pp 272–287 McDaid AF, Murphy BT, Friel N, Hurley NJ (2012) Model-based clustering in networks with stochastic community finding. arXiv preprint arXiv:1205.1997 Miao Y, Yu L, Blunsom P (2016) Neural variational inference for text processing. In: International conference on machine learning, pp 1727–1736 Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119 Mridha M, Hamid MA, Rana MM, Khan MEA, Ahmed MM, Sultan MT (2019) Semantic error detection and correction in Bangla sentence. In: 2019 Joint 8th international conference on informatics, electronics and vision (ICIEV) and 2019 3rd international conference on imaging, vision and pattern recognition (icIVPR). IEEE, pp 184–189 Müller MC (2017) Semantic author name disambiguation with word embeddings. In: International conference on theory and practice of digital libraries. Springer, pp 300–311 Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543 Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365 Pittke F, Leopold H, Mendling J (2015) Automatic detection and resolution of lexical ambiguity in process models. IEEE Trans Softw Eng 41(6):526–544 Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners. Open AI Blog 1(8):9 Roll U, Correia RA, Berger-Tal O (2018) Using machine learning to disentangle homonyms in large text corpora. Conserv Biol 32(3):716–724 Santana AF, Gonçalves MA, Laender AH, Ferreira AA (2017) Incremental author name disambiguation by exploiting domain-specific heuristics. J Assoc Inf Sci Technol 68(4):931–945 dos Santos CN, Gatti M (2014) Deep convolutional neural networks for sentiment analysis of short texts. In: COLING, pp 69–78 Schiemann T, Leser U, Hakenberg J (2009) Word sense disambiguation in biomedical applications: a machine learning approach. In: Information retrieval in biomedicine: natural language processing for knowledge integration. IGI Global, pp 142–161 Schuemie MJ, Kors JA, Mons B (2005) Word sense disambiguation in the biomedical domain: an overview. J Comput Biol 12(5):554–565 Schulz C, Mazloumian A, Petersen AM, Penner O, Helbing D (2014) Exploiting citation networks for large-scale author name disambiguation. EPJ Data Sci 3(1):11 Shaikh T, Deshpande D (2016) A review on opinion mining and sentiment analysis. Int J Comput Appl 975:8887 Sharma S, Srivastava SK (2016) Review on text mining algorithms. Int J Comput Appl 134(8):39–43 Shen Q, Wu T, Yang H, Wu Y, Qu H, Cui W (2016) Nameclarifier: a visual analytics system for author name disambiguation. IEEE Trans Vis Comput Graph 23(1):141–150 Singh T (2016) A comprehensive review of text mining. Int J Comput Sci Inf Technol 7(1):167–169 Smith NA (2019) Contextual word representations: a contextual introduction. arXiv preprint arXiv:1902.06006 Song M, Kim EHJ, Kim HJ (2015) Exploring author name disambiguation on pubmed-scale. J Informetr 9(4):924–941 Songa X, Mina YJ, Da-Xionga L, Fengb WZ, Shua C (2019) Research on text error detection and repair method based on online learning community. Procedia Comput Sci 154:13–19 Strehl A, Ghosh J (2002) Cluster ensembles: a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3(Dec):583–617 Suárez-Paniagua V, Segura-Bedmar I, Martínez P (2015) Word embedding clustering for disease named entity recognition. In: Proceedings of the fifth biocreative challenge evaluation workshop, pp 299–304 Sun S, Luo C, Chen J (2017) A review of natural language processing techniques for opinion mining systems. Inf Fusion 36:10–25 Tran HN, Huynh T, Do T (2014) Author name disambiguation by using deep neural network. In: Asian conference on intelligent information and database systems. Springer, pp 123–132 Tzanis G (2014) Biological and medical big data mining. Int J Knowl Discov Bioinform 4(1):42–56 Urban R, Anisimowicz H (2019) A note on the Durda, Caron, and Buchanan word ambiguity detection algorithm. Fundam Inform 168(1):79–88 Vinh NX, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 1073–1080 Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854 Westgate MJ, Lindenmayer DB (2017) The difficulties of systematic reviews. Conserv Biol 31(5):1002–1007 Xu H, Zhang C, Hao X, Hu Y (2007) A machine learning approach classification of deep web sources. In: Fourth international conference on fuzzy systems and knowledge discovery (FSKD 2007), vol 4. IEEE, pp 561–565 Yu J, Li J, Yu Z, Huang Q (2019) Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2019.2947482 Yu J, Tan M, Zhang H, Tao D, Rui Y (2019) Hierarchical deep click feature prediction for fine-grained image recognition. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2019.2932058 Yu J, Tao D, Wang M, Rui Y (2014) Learning to rank using user clicks and visual features for image retrieval. IEEE Trans Cybern 45(4):767–779 Yu J, Zhu C, Zhang J, Huang Q, Tao D (2019) Spatial pyramid-enhanced netvlad with weighted triplet loss for place recognition. IEEE Trans Neural Netw Learn Syst 31(2):661–674