Phân tích hồ sơ người nổi tiếng thông qua phân tích ngôn ngữ trên các mạng xã hội kỹ thuật số

Springer Science and Business Media LLC - Tập 8 - Trang 1-36 - 2021
Luis G. Moreno-Sandoval1,2, Alexandra Pomares-Quimbaya1,2, Jorge A. Alvarado-Valencia1,3
1Center of Excellence and Appropriation in Big Data Analytics (CAOBA), Bogota, Colombia
2Department of System Engineering, Pontificia Universidad Javeriana, Bogota, Colombia
3Department of Industrial Engineering, Pontificia Universidad Javeriana, Bogota, Colombia

Tóm tắt

Các mạng xã hội kỹ thuật số đã trở thành nguồn thông tin thiết yếu vì các người nổi tiếng sử dụng chúng để chia sẻ ý kiến, ý tưởng, suy nghĩ và cảm xúc của họ. Điều này khiến các mạng xã hội kỹ thuật số trở thành một trong những phương tiện ưa thích cho người nổi tiếng để quảng bá bản thân và thu hút người theo dõi mới. Bài báo này đề xuất một mô hình lựa chọn đặc điểm cho việc phân loại hồ sơ của người nổi tiếng dựa trên việc sử dụng mạng xã hội kỹ thuật số Twitter. Mô hình bao gồm phân tích các đặc điểm thông tin thuộc tính từ vựng, cú pháp, biểu tượng, tham gia và bổ sung của các bài viết của người nổi tiếng để ước lượng, dựa trên những điều này, các đặc điểm nhân khẩu học và sức ảnh hưởng của họ. Phân loại với những đặc điểm mới này có điểm F1 là 0.65 trong danh tiếng, 0.88 trong giới tính, 0.37 trong năm sinh, và 0.57 trong nghề nghiệp. Với những đặc điểm mới này, độ chính xác trung bình đã cải thiện lên 0.14. Kết quả là, các đặc điểm được trích xuất từ các dấu hiệu ngôn ngữ đã cải thiện hiệu suất của các mô hình dự đoán Danh tiếng và Giới tính và tạo điều kiện cho việc giải thích các kết quả mô hình. Đặc biệt, việc sử dụng ngôi thứ ba số ít là rất dự đoán trong mô hình Danh tiếng.

Từ khóa

#người nổi tiếng #phân tích ngôn ngữ #mạng xã hội kỹ thuật số #phân loại hồ sơ #Twitter

Tài liệu tham khảo

Sherchan, W., Nepal, S., Paris, C.: A survey of trust in social networks. ACM Comput. Surv. 45(4), 47–14733 (2013). https://doi.org/10.1145/2501654.2501661 Cercel, D.-C., Trausan-Matu, S.: Opinion propagation in online social networks: a survey. ACM International Conference Proceeding Series (2014). https://doi.org/10.1145/2611040.2611088 Allor, M.: Relocating the site of the audience. Crit. Stud. Mass Commun. 5(3), 217–233 (1988). https://doi.org/10.1080/15295038809366704 Reynolds, W.N., Salter, W.J., Farber, R.M., Corley, C., Dowling, C.P., Beeman, W.O., Smith-Lovin, L., Choi, J.N.: Sociolect-based community detection. In: 2013 IEEE International Conference on Intelligence and Security Informatics, pp. 221-226 (2013). https://doi.org/10.1109/ISI.2013.6578823 Golbeck, J.: Trust and nuanced profile similarity in online social networks. ACM Trans. Web 3(4), 12–11233 (2009). https://doi.org/10.1145/1594173.1594174 Mansouri, F., Abdelalim, S., Ikram, E.A.: A modeling framework for the moroccan sociolect recognition used on the social media. In: Proceedings of the 2Nd International Conference on Big Data, Cloud and Applications. BDCA’17, pp. 34–1345. ACM, New York, NY, USA (2017). https://doi.org/10.1145/3090354.3090389 Zanzotto, F.M., Pennacchiotti, M., Tsioutsiouliklis, K.: Linguistic redundancy in twitter. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. EMNLP ’11, pp. 659–669. Association for Computational Linguistics, Stroudsburg, PA, USA (2011). http://dl.acm.org/citation.cfm?id=2145432.2145509 Schwartz, H.A., Eichstaedt, J.C., Kern, M.L., Dziurzynski, L., Ramones, S.M., Agrawal, M., Shah, A., Kosinski, M., Stillwell, D., Seligman, M.E.P., Ungar, L.H.: Personality, gender, and age in the language of social media: the open-vocabulary approach. PLoS ONE 8(9), 73791 (2013). https://doi.org/10.1371/journal.pone.0073791 Yang, Y., Eisenstein, J.: Putting things in context: community-specific embedding projections for sentiment analysis (2015) Rampton, B., Tusting, K., Maybin, J., Barwell, R.D.: UK linguistic ethnography: a discussion paper coordinating committee UK linguistic ethnography forum 1, (2004) Rangel, F.M., Rosso, P., Montes-yGómez, M., Potthast, M., Stein, B.: Overview of the 6th author profiling task at pan 2018: multimodal gender identification in twitter. In: Notes Papers of the CLEF (2018) Moreno-Sandoval, L.G., Puertas, E.A., Plaza-del-Arco, F.M., Pomares-Quimbaya, A., Alvarado-Valencia, J.A., Alfonso, L., Ureña-López: Celebrity profiling on twitter using sociolinguistic features notebook for pan at clef 2019. (2019) Phad, P.V., Chavan, M.K.: Detecting compromised high-profile accounts on social networks. In: 2018 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT), pp. 1–4 (2018). https://doi.org/10.1109/ICCCNT.2018.8493851 Singh, M., Bansal, D., Sofat, S.: Who is who on twitter–spammer, fake or compromised account? A tool to reveal true identity in real-time. Cybern. Syst. 49(1), 1–25 (2018). https://doi.org/10.1080/01969722.2017.1412866 Aggarwal, C.C.. In: Aggarwal, C.C. (ed.): An Introduction to Social Network Data Analytics, pp. 1–15. Springer, Boston, MA (2011). https://doi.org/10.1007/978-1-4419-8462-3_1 Scott, J.: Social network analysis: developments, advances, and prospects. Soc. Netw. Anal. Min. 1(1), 21–26 (2011). https://doi.org/10.1007/s13278-010-0012-6 Vatrapu, R., Mukkamala, R.R., Hussain, A., Flesch, B.: Social set analysis: a set theoretical approach to big data analytics. IEEE Access 4, 1–1 (2016). https://doi.org/10.1109/ACCESS.2016.2559584 Li, C., Bai, J., Zhang, L., Tang, H., Luo, Y.: Opinion community detection and opinion leader detection based on text information and network topology in cloud environment. Inf. Sci. 504, 61–83 (2019). https://doi.org/10.1016/j.ins.2019.06.060 Zhang, H., Nguyen, D., Zhang, H., Thai, M.: Least cost influence maximization across multiple social networks. IEEE/ACM Trans. Netw. 24, 1–11 (2015). https://doi.org/10.1109/TNET.2015.2394793 Jadhav, K.U., Mhetre, N.A.: Mass users behaviour prediction in social media: a survey. Int. J. Comput. Sci. Inf. Technol. (IJCSIT) 5, 3286–3288 (2014) Fan, L., Wu, W., Zhai, X., Xing, K., Lee, W., Du, D.-Z.: Maximizing rumor containment in social networks with constrained time. Soc. Netw. Anal. Min. (2014). https://doi.org/10.1007/s13278-014-0214-4 Nguyen, D., Doğruöz, A.S., Rosé, C.P., de Jong, F.: Computational sociolinguistics: a survey. Comput. Linguist. 42(3), 537–593 (2016). https://doi.org/10.1162/COLI_a_00258 Tsytsarau, M., Palpanas, T.: Survey on mining subjective data on the web. Data Min. Knowl. Discov. 24(3), 478–514 (2012). https://doi.org/10.1007/s10618-011-0238-6 Radivchev, V., Nikolov, A., Lambova, A.: Celebrity profiling using tf-idf, logistic regression, and svm—notebook for pan at clef 2019. In: Cappellato, L., Ferro, N., Losada, D.E., Müller, H. (eds.) CLEF 2019 Labs and Workshops, Notebook Papers, vol. 2380. CEUR-WS.org, Switzerland (2019). http://ceur-ws.org/Vol-2380/ Martinc, M., Škrlj, B., Pollak, S.: Who is hot and who is not? Profiling celebs on Twitter—notebook for PAN at CLEF 2019. In: Cappellato, L., Ferro, N., Losada, D., Müller, H. (eds.) CLEF 2019 Labs and Workshops, Notebook Papers, vol. 2380. CEUR-WS.org, Switzerland (2019). http://ceur-ws.org/Vol-2380/ Petrik, J., Chuda, D.: Twitter feeds profiling with TF-IDF—notebook for PAN at CLEF 2019. In: Cappellato, L., Ferro, N., Losada, D., Müller, H. (eds.) CLEF 2019 Labs and Workshops, Notebook Papers, vol. 2380. CEUR-WS.org, Switzerland (2019). http://ceur-ws.org/Vol-2380/ Simaki, V., Aravantinou, C., Mporas, I., Kondyli, M., Megalooikonomou, V.: Sociolinguistic features for author gender identification: from qualitative evidence to quantitative analysis. J. Quant. Linguist. 24(1), 65–84 (2017). https://doi.org/10.1080/09296174.2016.1226430 Peersman, C., Daelemans, W., Van Vaerenbergh, L.: Predicting age and gender in online social networks. In: Proceedings of the 3rd International Workshop on Search and Mining User-generated Contents. SMUC ’11, pp. 37–44. , ACM, New York, NY, USA (2011). https://doi.org/10.1145/2065023.2065035 Huang, Y., Yu, L., Wang, X., Cui, B.: A multi-source integration framework for user occupation inference in social media systems. World Wide Web 18(5), 1247–1267 (2015). https://doi.org/10.1007/s11280-014-0300-6 Sánchez-Rebollo, C., Puente, C., Palacios, R., Piriz, C., Fuentes, J.P., Jarauta, J.: Detection of jihadism in social networks using big data techniques supported by graphs and fuzzy clustering. Complexity 2019, 1–13 (2019). https://doi.org/10.1155/2019/1238780 Milroy, J., Milroy, L.: Mechanisms of change in urban dialects: the role of class, social network and gender. Int. J. Appl. Linguist. 3(1), 57–77 (1993). https://doi.org/10.1111/j.1473-4192.1993.tb00043.x Przybyła, P., Teisseyre, P.: Analysing utterances in polish parliament to predict speaker’s background. J. Quant. Linguist. 21(4), 350–376 (2014) Argamon, S., Fine, J., Rachel Shimoni, A.: Gender, genre, and writing style in formal written texts. Text (2003). https://doi.org/10.1515/text.2003.014 Romaine, S.: Language and Social Class, pp. 281–287. (2015). https://doi.org/10.1016/B978-0-08-097086-8.53015-3 Sloan, L., Morgan, J., Burnap, P., Williams, M.: Who tweets? deriving the demographic characteristics of age, occupation and social class from twitter user meta-data. PLOS ONE 10(3), 1–20 (2015). https://doi.org/10.1371/journal.pone.0115545 Wiegmann, M., Stein, B., Potthast, M.: Celebrity profiling. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2611–2618. Association for Computational Linguistics, Florence, Italy (2019). https://www.aclweb.org/anthology/P19-1249 Watts, D., Dodds, P.: Influentials, networks, and public opinion formation. J. Consum. Res. 34, 441–458 (2007). https://doi.org/10.1086/518527 Leskovec, J., Adamic, L.A., Huberman, B.A.: The dynamics of viral marketing. ACM Trans. Web (2007). https://doi.org/10.1145/1232722.1232727 Djafarova, E., Trofimenko, O.: ‘instafamous’—credibility and self-presentation of micro-celebrities on social media. Inf. Commun. Soc. 22(10), 1432–1446 (2019) Wang, Y.-C., Kraut, R.E.: Twitter and the development of an audience: those who stay on topic thrive! In: CHI (2012) Hutto, C.J., Yardi, S., Gilbert, E.: In: A longitudinal study of follow predictors on twitter, In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. CHI ’13, pp. 821–830. , ACM, New York, NY, USA (2013). https://doi.org/10.1145/2470654.2470771 Chang, S., Kumar, V., Gilbert, E., Terveen, L.: Specialization, homophily, and gender in a social curation site: Findings From Pinterest, pp. 674–686 (2014). https://doi.org/10.1145/2531602.2531660 Wang, Chun: Ya Jun Du, Ming Wei Tang: Opinion leader mining algorithm in microblog platform based on topic similarity. In: 2016 2nd IEEE International Conference on Computer and Communications (ICCC), pp. 160-165 (2016). https://doi.org/10.1109/CompComm.2016.7924685 Kiang, M.Y.: Neural networks. In: Bidgoli, H. (ed.) Encyclopedia of Information Systems, pp. 303–315. Elsevier, New York (2003). https://doi.org/10.1016/B0-12-227240-4/00121-0 . https://www.sciencedirect.com/science/article/pii/B978008044910400482X Casas, I.: Neural networks. In: Kitchin, R., Thrift, N. (eds.) International Encyclopedia of Human Geography, pp. 419–422. Elsevier, Oxford (2009). https://doi.org/10.1016/B978-008044910-4.00482-X . www.sciencedirect.com/science/article/pii/B978008044910400482X Hsu, C.-C., Lee, Y.-C., Lu, P.-E., Lu, S.-S., Lai, H.-T., Huang, C.-C., Wang, C., Lin, Y.-J., Su, W.-T.: Social media prediction based on residual learning and random forest, In: Proceedings of the 25th ACM International Conference on Multimedia. MM ’17, pp. 1865-1870. Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3123266.3127894 Huang, J., Tang, Y., Hu, Y., Li, J., Hu, C.: Predicting the active period of popularity evolution: a case study on twitter hashtags. Inf. Sci. 512, 315–326 (2020). https://doi.org/10.1016/j.ins.2019.04.028 Zhang, Q., Gong, Y., Wu, J., Huang, H., Huang, X.: In: Retweet prediction with attention-based deep neural network. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. CIKM ’16, pp. 75-84. Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2983323.2983809 Li, J., Xu, H., He, X., Deng, J., Sun, X.: Tweet modeling with lstm recurrent neural networks for hashtag recommendation, pp. 1570–1577 (2016). https://doi.org/10.1109/IJCNN.2016.7727385 Simaki, V., Mporas, I., Megalooikonomou, V.: Evaluation and sociolinguistic analysis of text features for gender and age identification. Am. J. Eng. Appl. Sci. 9, 868–876 (2016). https://doi.org/10.3844/ajeassp.2016.868.876 Johannsen, A., Hovy, D., Søgaard, A.: Cross-lingual syntactic variation over age and gender. (2015). https://doi.org/10.18653/v1/K15-1011 Namugera, F., Wesonga, R., Jehopio, P.: Text mining and determinants of sentiments: Twitter social media usage by traditional media houses in Uganda. Comput. Soc. Netw. (2019). https://doi.org/10.1186/s40649-019-0063-4 Zhong, G., Wang, L.-N., Dong, J.: An overview on data representation learning: from traditional feature learning to recent deep learning. J. Financ. Data Sci. (2016). https://doi.org/10.1016/j.jfds.2017.05.001 Wan, Y., Chen, X., Zhang, J.: Global and intrinsic geometric structure embedding for unsupervised feature selection. Expert Syst. Appl. (2017). https://doi.org/10.1016/j.eswa.2017.10.008 Sirovich, L., Kirby, M.: Low-dimensional procedure for the characterization of human faces. J. Opt. Soc. Am. A Opt Image Sci. 4, 519–24 (1987). https://doi.org/10.1364/JOSAA.4.000519 Jolliffe, I.. In: Lovric, M. (ed.) Principal Component Analysis, pp. 1094–1096. Springer, Berlin, Heidelberg (2011). https://doi.org/10.1007/978-3-642-04898-2_455 Peng, H., Bao, M., Li, J., Bhuiyan, M., Liu, Y., He, Y., Yang, E.: Incremental term representation learning for social network analysis. Future Gener. Comput. Syst. 86, 1503–1512 (2018). https://doi.org/10.1016/j.future.2017.05.020 Wang, S., Tang, J., Liu, H.: Embedded unsupervised feature selection. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence. AAAI’15, pp. 470-476. AAAI Press. (2015) Zhang, B., Xiang, J., Wang, X.: Network representation learning with ensemble methods. Neurocomputing 380, 141–149 (2020). https://doi.org/10.1016/j.neucom.2019.10.098 Peña, D.: Análisis de Datos Multivariantes. S.A. MCGRAW-HILL / INTERAMERICANA DE ESPAÑA, España (2002) Sluban, B., Smailović, J., Battiston, S., Mozetič, I.: Sentiment leaning of influential communities in social networks. Comput. Soc. Netw. (2015). https://doi.org/10.1186/s40649-015-0016-5 Avnit, A.: The million followers fallacy. Pravda Media Group (2009) Suh, B., Hong, L., Pirolli, P., Chi, E.H.: Want to be retweeted? Large scale analytics on factors impacting retweet in twitter network. In: 2010 IEEE Second International Conference on Social Computing, pp. 177-184 (2010) Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research Architecture, pp. 123–160 (2019). https://doi.org/10.1007/978-3-030-22948-1_5 Yazdanfar, N., Thomo, A.: Link recommender: Collaborative-filtering for recommending urls to twitter users. Procedia Computer Science 19, 412–419 (2013). https://doi.org/10.1016/j.procs.2013.06.056. The 4th International Conference on Ambient Systems, Networks and Technologies (ANT 2013), the 3rd International Conference on Sustainable Energy Information Technology (SEIT-2013) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011) Lemaître, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18(17), 1–5 (2017) Wiegmann, M., Stein, B., Potthast, M.: Overview of the Celebrity Profiling Task at PAN 2019. In: Cappellato, L., Ferro, N., Losada, D., Müller, H. (eds.) CLEF 2019 Labs and Workshops, Notebook Papers, vol. 2380. CEUR-WS.org, Switzerland (2019). http://ceur-ws.org/Vol-2380/ Lim, K.H., Datta, A.: Finding twitter communities with common interests using following links of celebrities. (2012). https://doi.org/10.1145/2310057.2310064 Stoop, W., Van den Bosch, A.: Using idiolects and sociolects to improve word prediction, pp. 318–327 (2014). https://doi.org/10.3115/v1/E14-1034 Copland, F., Shaw, S., Snell, J.: Linguistic Ethnography: Interdisciplinary Explorations. Springer, London (2016) Choi, C.J., Berger, R.: Ethics of celebrities and their increasing influence in 21st century society. J. Bus. Ethics 91(3), 313–318 (2010). https://doi.org/10.1007/s10551-009-0090-4 Friendly, M.: Corrgrams: exploratory displays for correlation matrices. Am. Stat. 56, 316–324 (2002) Chessel, D., Dufour, A.-B., Thioulouse, J.: The ade4 package - I: one-table methods. R News 4(1), 5–10 (2004) Lê, S., Josse, J., Husson, F.: FactoMineR: an R package for multivariate analysis. J. Stat. Softw. Artic. 25(1), 1–18 (2008). https://doi.org/10.18637/jss.v025.i01 Cappellato, L., Ferro, N., Losada, D.E., Müller, H. (eds.): CLEF 2019 Labs and Workshops, Notebook Papers, vol. 2380. CEUR-WS.org, Switzerland (2019) Moreno-Sandoval, L.G., Mendoza-Molina, J.F., Puertas-Del Castillo, E.A., Duque-Marín, A., Pomares-Quimbaya, A., Alvarado-Valencia, J.A.: Age classification from Spanish tweets - the variable age analyzed by using linear classifiers. In: Hammoudi, S., Smialek, M., Camp, O., Filipe, J. (eds.) Proceedings of the 20th International Conference on Enterprise Information Systems (ICEIS 2018), pp. 275–281 (2018). https://doi.org/10.5220/0006811102750281 Moreno-Sandoval, L.G., Sanchéz-Barriga, C., Espíndola-Buitrago, K., Pomares-Quimbaya, A., Garcia, G.C.: Spanish Twitter data used as a source of information about consumer food choice. In: Holzinger, A., Kieseberg, P., Tjoa, A., Weippl, E. (eds.) Machine Learning and Knowledge Extraction. International Cross-Domain Conference for Machine Learning and Knowledge Extraction. CD-MAKE 2018. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99740-7_9