Word clustering based on POS feature for efficient twitter sentiment analysis

Yili Wang1, KyungTae Kim2, ByungJun Lee1, Hee Yong Youn2
1College of Information and Communication Engineering, Sungkyunkwan University, Suwon, Korea
2College of Software, Sungkyunkwan University, Suwon, Korea

Tóm tắt

With rapid growth of social networking service on Internet, huge amount of information are continuously generated in real time. As a result, sentiment analysis of online reviews and messages has become a popular research issue [1]. In this paper a novel modified Chi Square-based feature clustering and weighting scheme is proposed for the sentiment analysis of twitter message. Along with the part of speech tagging, the discriminability and dependency of the words in the tagged training dataset are taken into account in the clustering and weighting process. The multinomial Naïve Bayes model is also employed to handle redundant features, and the influence of emotional words is raised for maximizing the accuracy. Computer simulation with Sentiment 140 workload shows that the proposed scheme significantly outperforms four existing representative sentiment analysis schemes in terms of the accuracy regardless of the size of training and test data.

Tài liệu tham khảo

Lizhen L et al (2014) A novel feature-based method for sentiment analysis of Chinese product reviews. China Commun 11:154–164. https://doi.org/10.1109/CC.2014.6825268 Bidi N, Elberrichi Z (2016) Feature selection for text classification using genetic algorithms. Paper presented at the 2016 8th international conference on modelling, identification and control, 806–810 Nov 2016. https://doi.org/10.1109/icmic.2016.7804223 Qiang G (2010) An effective algorithm for improving the performance of Naive Bayes for text classification. Paper presented at the second international conference on computer research and development, 699–701 May 2010. https://doi.org/10.1109/iccrd.2010.160 Sharma N et al (2016) Text classification using combined sparse representation classifiers and support vector machines. Paper presented at the 4th international symposium on computational and business intelligence, 181–185 November 2016. https://doi.org/10.1109/iscbi.2016.7743280 Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. Paper presented at the European conference on machine learning, 137–142 April 1998. https://doi.org/10.1007/BFb0026683 Qiaowei J (2016) Deep feature weighting in Naive Bayes for Chinese text classification. Paper presented at the 4th international conference on cloud computing and intelligence systems, 160–164 December 2016. https://doi.org/10.1109/ccis.2016.7790245 Krouska A, Troussas C, Virvou M (2016) The effect of preprocessing techniques on twitter sentiment analysis. Paper presented at the 7th international conference on information, intelligence, systems and applications, 1–5 December 2016. https://doi.org/10.1109/iisa.2016.7785373 Suresh H (2016) An unsupervised fuzzy clustering method for twitter sentiment analysis. Paper presented at the international conference on computation system and information technology for sustainable solutions, 80–85 December 2016. https://doi.org/10.1109/csitss.2016.7779444 Yang A et al (2015) Enhanced twitter sentiment analysis by using feature selection and combination. Paper presented at the international symposium on security and privacy in social networks and big data, 52–57 Nov 2015. https://doi.org/10.1109/socialsec2015.9 Pang B, Lillian L, Vaithyanathan S (2002) Thumbs up?: sentiment classification using machine learning techniques. Paper presented at the ACL-02 conference on empirical methods in natural language processing, 10:79–86 July 2002. https://doi.org/10.3115/1118693.1118704 Zou H et al (2015) Sentiment classification using machine learning techniques with syntax features. Paper presented at the international conference on computational science and computational intelligence, 175–179 March 2015. https://doi.org/10.1109/csci.2015.44 Socher R et al (2013) Recursive deep models for semantic compositionality over a sentiment treebank. Paper presented at the conference on empirical methods in natural language processing, 1631–1642, 2013 Singh J, Singh G, Singh R (2017) Optimization of sentiment analysis using machine learning classifiers. Hum Comput Inf Sci 7:32. https://doi.org/10.1186/s13673-017-0116-3 Yu N et al (2016) A comprehensive review of emerging computational methods for gene identification. J Inf Proc Syst 12:1. https://doi.org/10.3745/JIPS.04.0023 Jiadong Z, San-Segundo R, Pardo JM (2017) Feature extraction for robust physical activity recognition. Hum Comput Inf Sci 7:16. https://doi.org/10.1186/s13673-017-0097-2 Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: ICML. 97:412–420. ISBN:1-55860-486-3 Xu Y, Chen L (2010) Term-frequency based feature selection methods for text categorization. Paper presented at the 4th international conference on genetic and evolutionary computing, 280–283 Dec 2010. https://doi.org/10.1109/icgec.2010.76 Yili W et al (2017) A novel feature-based text classification improving the accuracy of twitter sentiment analysis. Paper presented at the 12th international conference on future information technology, 440–445 May 2017. https://doi.org/10.1007/978-981-10-7605-3_72 Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Pro Man 24:513–523. https://doi.org/10.1016/0306-4573(88)90021-0 Zhihua X et al (2016) A secure and dynamic multi-keyword ranked search scheme over encrypted cloud data. IEEE Trans Parallel Dis Syst 27:340–352. https://doi.org/10.1109/TPDS.2015.2401003 Wen Z, Yoshida T, Xinjing T (2011) a comparative study of TF* IDF, LSI and multi-words for text classification. Expert Syst Appl 38:2758–2765. https://doi.org/10.1016/j.eswa.2010.08.066 Chuanxin J et al (2015) Chi square statistics feature selection based on term frequency and distribution for text categorization. IETE J Res 61:351–362. https://doi.org/10.1080/03772063.2015.1021385 Zhangjie F et al (2016) Enabling personalized search over encrypted outsourced data with efficiency improvement. IEEE Trans Parallel Distrib Syst 27:2546–2559. https://doi.org/10.1109/TPDS.2015.2506573 Tinghuai M et al (2016) LED: a fast overlapping communities detection algorithm based on structural clustering. Neurocomput 207:488–500. https://doi.org/10.1016/j.neucom.2016.05.020 Sentiment analysis workload Sentiment 140. http://help.sentiment140.com/home. Accessed 15 May 2016 Zebin W et al (2016) Parallel and distributed dimensionality reduction of hyperspectral data on cloud computing architectures. IEEE J Sel Top App Earth Obs Remote Sens 9:2270–2278. https://doi.org/10.1109/JSTARS.2016.2542193 Paul S, Das S (2015) simultaneous feature selection and weighting–an evolutionary multi-objective optimization approach. Pattern Recognit Lett 65:51–59. https://doi.org/10.1016/j.patrec.2015.07.007 Zhaoqing P et al (2016) Fast motion estimation based on content property for low-complexity H.265/HEVC encoder. IEEE Trans Broad 62:675–684. https://doi.org/10.1109/TBC.2016.2580920 Wikipedia Naïve Bayes classifier. https://en.wikipedia.org/wiki/Naive_Bayes_classifier. Accessed 3 June 2016 Suresh Y (2016) Software quality assessment for open source software using logistic and Naive Bayes classifier. Paper presented at the International conference on computation system and information technology for sustainable solutions, 267–272 Oct 2016. https://doi.org/10.1109/csitss.2016.7779369 Singh M, Provan, GM (1996) A comparison of induction algorithms for selective and non-selective Bayesian classifiers. Paper presented at the international conference on machine learning, 497–505 May 1996. https://doi.org/10.1016/b978-1-55860-377- 6.50068-2 Wikipedia Sentiment analysis. https://en.wikipedia.org/wiki/Sentiment_analysis. Accessed 20 Jan 2016 Wilson T, Wiebe J, Hoffmann P (2005) Recognizing contextual polarity in phrase-level sentiment analysis. Paper presented at the conference on human language technology and empirical methods in natural language processing, 347–354 Oct 2015. https://doi.org/10.3115/1220575.1220619 Miller G et al (1990) Introduction to wordnet: an on-line lexical database. Int J Lexicogr 3:235–244. https://doi.org/10.1093/ijl/3.4.235 Baccianella S, Esuli A, Sebastiani F (2010) Sentiwordnet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In: LREC, 10:2200–2204 May 2010 Troussas C et al (2013) Sentiment analysis of Facebook statuses using Naive Bayes classifier for language learning. Paper presented at the 4th international conference on information, intelligence, systems and applications, 1–6 July 2013. https://doi.org/10.1109/iisa.2013.6623713 Krouska A, Troussas C, Virvou M (2016) The effect of preprocessing techniques on twitter sentiment analysis. Paper presented at the 7th international conference on information, intelligence, systems and applications, 1–5 July 2016. https://doi.org/10.1109/iisa.2016.7785373 Troussas C, Krouska A, Virvou M (2016) Evaluation of ensemble-based sentiment classifiers for twitter data. Paper presented at the 7th international conference on information, intelligence, systems and applications, 1–6 July 2016. https://doi.org/10.1109/iisa.2016.7785380 Krouska A, Troussas C, Virvou M (2017) Comparative evaluation of algorithms for sentiment analysis over social networking services. J Univers Comput Sci 23(8):755–768. https://doi.org/10.3217/jucs-023-08-0755 Ravichandran M, Kulanthaivel G (2014) Twitter sentiment mining (TSM) framework based learners emotional state classification and visualization for e-learning system. J Theor Appl Inf Technol 69(1):84–90 Yu Y, Xiao W (2015) World Cup 2014 in the twitter World: a big data analysis of sentiments in US sports fans. Comput Hum Behav 48:392–400. https://doi.org/10.1016/j.chb.2015.01.075 Smailović J et al (2014) Stream-based active learning for sentiment analysis in the financial domain. Inf Sci 285:181–203. https://doi.org/10.1016/j.ins.2014.04.034 Silva Da et al (2014) Tweet sentiment analysis with classifier ensembles. Decis Support Syst 66:170–179. https://doi.org/10.1016/j.dss.2014.07.003 Yuhui Z et al (2017) Student’s t-hidden Markov model for unsupervised learning using localized feature selection. IEEE Trans Circuits Syst Video Technol. https://doi.org/10.1109/TCSVT.2017.2724940 Bahassine S, Madani A, Kissi M (2016) An improved Chi-sqaure feature selection for Arabic text classification using decision tree. Paper presented at the international conference on intelligent systems: theories and applications, 1–5 Oct 2016. https://doi.org/10.1109/sita.2016.7772289 Stanford Log-linear Part of Speech Tagger. http://nlp.stanford.edu/software/tagger.shtml. Accessed 13 Mar 2017 Slide share text analysis for security. https://www.slideshare.net/taoxiease/text-analytics-for-security. Accessed 16 Apr 2017 Mekuria Z, Assabie Y (2014) A hybrid approach to the development of part-of-speech tagger for Kafi-noonoo text. Paper presented at the international conference on intelligent text processing and computational linguistics, 214–224 April 2014. https://doi.org/10.1007/978-3-642-54906-9_17 O’Keefe T, Koprinska I (2009) feature selection and weighting methods in sentiment analysis. Paper presented at the 14th Australasian document computing symposium, 67–74 Dec 2009 Sebastiani F (2002) machine learning in automated text categorization. Paper presented at the ACM computing surveys, 34:1–47 March 2002. https://doi.org/10.1145/505282.505283 Zhaoqing P et al (2016) Fast reference frame selection based on content similarity for low complexity HEVC encoder. J Vis Commun Image Rep 40:516–524. https://doi.org/10.1016/j.jvcir.2016.07.018 Go A, Bhayani R, Huang L (2009) twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 12, Dec 2009 Jinwei W et al (2017) Forensics feature analysis in quaternion wavelet domain for distinguishing photographic images and computer graphics. Multimedia Tools Appl 76:23721–23737. https://doi.org/10.1007/s11042-016-4153-0 Jin W et al (2015) Bio-inspired ant colony optimization based clustering algorithm with mobile sinks for applications in consumer home automation networks. IEEE Trans Consumer Electron 61:438–444. https://doi.org/10.1109/TCE.2015.7389797 Jin W et al (2005) A load-balancing and energy-aware clustering algorithm in wireless ad-hoc networks. Paper presented at the international conference on embedded and ubiquitous computing, 1108–1117 Dec 2005. https://doi.org/10.1007/11596042_113 Jin W et al (2017) Energy-efficient cluster-based dynamic routes adjustment approach for wireless sensor networks with mobile sinks. J Supercomput 73:3277–3290. https://doi.org/10.1007/s11227-016-1947-9 Zhangjie F et al (2015) Privacy-preserving smart similarity search based on Simhash over encrypted data in cloud computing. J Int Technol 16:453–460. https://doi.org/10.6138/JIT.2015.16.3.20140918 Huan R et al (2018) A novel subgraph K+-isomorphism method in social network based on graph similarity detection. Soft Comput 22:2583–2601. https://doi.org/10.1007/s00500-017-2513-y Gu B, Sun X, Sheng VS (2017) Structural minimax probability machine. IEEE Trans Neural Netw Learn Syst 28:1646–1656. https://doi.org/10.1109/TNNLS.2016.2544779 Wikipedia.Chi squared distribution. https://en.wikipedia.org/wiki/Chisquared_distribution. Accessed 2 July 2017 Gu B, Sheng VS (2017) A robust regularization path algorithm for ν-support vector classification. IEEE Trans Neural Netw Learn Syst 28:1241–1248. https://doi.org/10.1109/TNNLS.2016.2527796 The Stanford Natural language processing group. Naive Bayes text classification. http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html. Accessed 25 June 2016 GitHub. Matlab-standford-postagger. https://github.com/musically-ut/matlab-stanford-postagger. Accessed 15 Mar 2017