Systematic Homonym Detection and Replacement Based on Contextual Word Embedding
Tóm tắt
Homonyms are words that share their spelling but differ in meaning and are a common feature in most languages. Homonyms are a source of noise i most text analyses and are difficult to detect; numerous studies have been conducted in this regard. However, extant methods typically detect homonyms using a rule-based or statistical-based approach, which requires an answer set, with little regard to the semantic meaning of the word. Therefore, we propose a novel approach for the detection of homonyms based on contextual word embedding that allows a word to be understood based on the context in which it appears. In this study, we extracted all contextual word embedding vectors of individual words and clustered those vectors using a spherical k-means clustering to detect pairs of homonyms. In addition, we developed a homonym replacement method to increase the performance of a document embedding technique, based on the word vector value. We replaced the embedding vectors of homonyms with a representative vector based on the respective meaning using the proposed homonym detection method. Experimental results indicate that the proposed method effectively detects homonyms and significantly improves the performance of document embedding.
Tài liệu tham khảo
An Y, Liu S, Wang H (2020) Error detection in a large-scale lexical taxonomy. Information 11(2):97
Balazs JA, Velásquez JD (2016) Opinion mining and information fusion: a survey. Inf Fusion 27:95–110
Bhardwaj P, Khosla P (2017) Review of text mining techniques. IITM J Manag IT 8(1):27–31
Buchta C, Kober M, Feinerer I, Hornik K (2012) Spherical k-means clustering. J Stat Softw 50(10):1–22
Correia RA, Jepson P, Malhado AC, Ladle RJ (2017) Internet scientific name frequency as an indicator of cultural salience of biodiversity. Ecol Indic 78:549–555
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Ferreira AA, Veloso A, Gonçalves MA, Laender AH (2014) Self-training author name disambiguation for information scarce scenarios. J Assoc Inf Sci Technol 65(6):1257–1278
Harris ZS (1954) Distributional structure. Word 10(2–3):146–162
Heo Y, Kang S, Seo J (2020) Hybrid sense classification method for large-scale word sense disambiguation. IEEE Access 8:27247–27256
Hong C, Yu J, Tao D, Wang M (2014) Image-based three-dimensional human pose recovery by multiview locality-sensitive sparse retrieval. IEEE Trans Ind Electron 62(6):3742–3751
Hong C, Yu J, Wan J, Tao D, Wang M (2015) Multimodal deep autoencoder for human pose recovery. IEEE Trans Image Process 24(12):5659–5670
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218
Kågebäck M, Salomonsson H (2016) Word sense disambiguation using a bidirectional lstm. arXiv preprint arXiv:1606.03568
Kim HK, Kim H, Cho S (2017) Bag-of-concepts: comprehending document representation through clustering words in distributed representation. Neurocomputing 266:336–352
Kim Y (2014) Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882
Ladle RJ, Correia RA, Do Y, Joo GJ, Malhado AC, Proulx R, Roberge JM, Jepson P (2016) Conservation culturomics. Front Ecol Environ 14(5):269–275
Lee Y, Im J, Cho S, Choi J (2018) Applying convolution filter to matrix of word-clustering based document representation. Neurocomputing 315:210–220
Lee Y, Song S, Cho S, Choi J (2019) Document representation based on probabilistic word clustering in customer-voice classification. Pattern Anal Appl 22(1):221–232
Liu P, Qiu X, Huang X (2016) Recurrent neural network for text classification with multi-task learning. arXiv preprint arXiv:1605.05101
Liu W, Islamaj Doğan R, Kim S, Comeau DC, Kim W, Yeganova L, Lu Z, Wilbur WJ (2014) Author name disambiguation for pub med. J Assoc Inf Sci Technol 65(4):765–781
Louppe G, Al-Natsheh HT, Susik M, Maguire EJ (2016) Ethnicity sensitive author disambiguation using semi-supervised learning. In: International conference on knowledge engineering and the semantic web. Springer, pp 272–287
McDaid AF, Murphy BT, Friel N, Hurley NJ (2012) Model-based clustering in networks with stochastic community finding. arXiv preprint arXiv:1205.1997
Miao Y, Yu L, Blunsom P (2016) Neural variational inference for text processing. In: International conference on machine learning, pp 1727–1736
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
Mridha M, Hamid MA, Rana MM, Khan MEA, Ahmed MM, Sultan MT (2019) Semantic error detection and correction in Bangla sentence. In: 2019 Joint 8th international conference on informatics, electronics and vision (ICIEV) and 2019 3rd international conference on imaging, vision and pattern recognition (icIVPR). IEEE, pp 184–189
Müller MC (2017) Semantic author name disambiguation with word embeddings. In: International conference on theory and practice of digital libraries. Springer, pp 300–311
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. arXiv preprint arXiv:1802.05365
Pittke F, Leopold H, Mendling J (2015) Automatic detection and resolution of lexical ambiguity in process models. IEEE Trans Softw Eng 41(6):526–544
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners. Open AI Blog 1(8):9
Roll U, Correia RA, Berger-Tal O (2018) Using machine learning to disentangle homonyms in large text corpora. Conserv Biol 32(3):716–724
Santana AF, Gonçalves MA, Laender AH, Ferreira AA (2017) Incremental author name disambiguation by exploiting domain-specific heuristics. J Assoc Inf Sci Technol 68(4):931–945
dos Santos CN, Gatti M (2014) Deep convolutional neural networks for sentiment analysis of short texts. In: COLING, pp 69–78
Schiemann T, Leser U, Hakenberg J (2009) Word sense disambiguation in biomedical applications: a machine learning approach. In: Information retrieval in biomedicine: natural language processing for knowledge integration. IGI Global, pp 142–161
Schuemie MJ, Kors JA, Mons B (2005) Word sense disambiguation in the biomedical domain: an overview. J Comput Biol 12(5):554–565
Schulz C, Mazloumian A, Petersen AM, Penner O, Helbing D (2014) Exploiting citation networks for large-scale author name disambiguation. EPJ Data Sci 3(1):11
Shaikh T, Deshpande D (2016) A review on opinion mining and sentiment analysis. Int J Comput Appl 975:8887
Sharma S, Srivastava SK (2016) Review on text mining algorithms. Int J Comput Appl 134(8):39–43
Shen Q, Wu T, Yang H, Wu Y, Qu H, Cui W (2016) Nameclarifier: a visual analytics system for author name disambiguation. IEEE Trans Vis Comput Graph 23(1):141–150
Singh T (2016) A comprehensive review of text mining. Int J Comput Sci Inf Technol 7(1):167–169
Smith NA (2019) Contextual word representations: a contextual introduction. arXiv preprint arXiv:1902.06006
Song M, Kim EHJ, Kim HJ (2015) Exploring author name disambiguation on pubmed-scale. J Informetr 9(4):924–941
Songa X, Mina YJ, Da-Xionga L, Fengb WZ, Shua C (2019) Research on text error detection and repair method based on online learning community. Procedia Comput Sci 154:13–19
Strehl A, Ghosh J (2002) Cluster ensembles: a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3(Dec):583–617
Suárez-Paniagua V, Segura-Bedmar I, Martínez P (2015) Word embedding clustering for disease named entity recognition. In: Proceedings of the fifth biocreative challenge evaluation workshop, pp 299–304
Sun S, Luo C, Chen J (2017) A review of natural language processing techniques for opinion mining systems. Inf Fusion 36:10–25
Tran HN, Huynh T, Do T (2014) Author name disambiguation by using deep neural network. In: Asian conference on intelligent information and database systems. Springer, pp 123–132
Tzanis G (2014) Biological and medical big data mining. Int J Knowl Discov Bioinform 4(1):42–56
Urban R, Anisimowicz H (2019) A note on the Durda, Caron, and Buchanan word ambiguity detection algorithm. Fundam Inform 168(1):79–88
Vinh NX, Epps J, Bailey J (2009) Information theoretic measures for clusterings comparison: is a correction for chance necessary? In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 1073–1080
Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854
Westgate MJ, Lindenmayer DB (2017) The difficulties of systematic reviews. Conserv Biol 31(5):1002–1007
Xu H, Zhang C, Hao X, Hu Y (2007) A machine learning approach classification of deep web sources. In: Fourth international conference on fuzzy systems and knowledge discovery (FSKD 2007), vol 4. IEEE, pp 561–565
Yu J, Li J, Yu Z, Huang Q (2019) Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2019.2947482
Yu J, Tan M, Zhang H, Tao D, Rui Y (2019) Hierarchical deep click feature prediction for fine-grained image recognition. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2019.2932058
Yu J, Tao D, Wang M, Rui Y (2014) Learning to rank using user clicks and visual features for image retrieval. IEEE Trans Cybern 45(4):767–779
Yu J, Zhu C, Zhang J, Huang Q, Tao D (2019) Spatial pyramid-enhanced netvlad with weighted triplet loss for place recognition. IEEE Trans Neural Netw Learn Syst 31(2):661–674