Computational approaches to detect experts in distributed online communities: a case study on Reddit
Tóm tắt
The irreplaceable key to the triumph of Question & Answer (Q & A) platforms is their users providing high-quality answers to the challenging questions posted across various topics of interest. From more than a decade, the expert finding problem attracted much attention in information retrieval research. Based on the encountered gaps in the expert identification across several Q & A portals, we inspect the feasibility of identifying data science experts in Reddit. Our method is based on the manual coding results where two data science experts labelled not only expert and non-expert comments, but also out-of-scope comments, which is a novel contribution to the literature, enabling the identification of more groups of comments across web portals. We present a semi-supervised approach which combines 1113 labelled comments with 100,226 unlabelled comments during training. We proved that it is possible to develop models that can identify expert, non-expert and out-of-scope comments peaking the AUC score at 0.93, accuracy at 0.83, MAE at 0.15 degrees and R2 score at 0.69. The proposed model uses the activity behaviour of every user, including Natural Language Processing (NLP), crowdsourced and user feature sets. We conclude that the NLP and user feature sets contribute the most to the better identification of these three classes. It means that this method can generalise well within the domain. Finally, we make a novel contribution by presenting different types of users in Reddit, which opens many future research directions.
Từ khóa
Tài liệu tham khảo
Razeeth, M., Kariapper, R., Pirapuraj, P., Nafrees, A., Rishan, U., Nusrath Ali, S.: E-learning at home vs traditional learning among higher education students: a survey based analysis (2019)
Strukova, S., Ruipérez-Valiente, J.A., Mármol, F.G.: A survey on data-driven evaluation of competencies and capabilities across multimedia environments. Int. J. Interact. Multi. Artif. Intell. (2022). https://doi.org/10.9781/ijimai.2022.10.004
Aabdelaziz: Best & Most Popular Forums, Message Boards & Online Communities. https://it-maniacs.com/best-and-most-popular-forums-message-boards-and-online-communities-top-30/. Accessed 10 Feb 2022 (2021)
Ansari, N., Sharma, R.: Identifying semantically duplicate questions using data science approach: A quora case study. arXiv preprint arXiv:2004.11694 (2020). https://doi.org/10.48550/arXiv.2004.11694
Rogers, A., Gardner, M., Augenstein, I.: Qa dataset explosion: a taxonomy of nlp resources for question answering and reading comprehension. ACM Comput. Surv. (2023). https://doi.org/10.1145/3560260
Graham, M., Dutton, W.H.: Society and the Internet: How Networks of Information and Communication Are Changing Our Lives. Oxford University Press, England (2014). https://doi.org/10.1093/acprof:oso/9780199661992.001.0001
Lim, W.H., Carman, M.J., Wong, S.-M.J.: Estimating relative user expertise for content quality prediction on reddit. In: Proceedings of the 28th ACM Conference on Hypertext and Social Media. HT ’17, (pp. 55–64). Association for Computing Machinery, New York (2017)
Azhar, A., Rubab, S., Khan, M.M., Bangash, Y.A., Alshehri, M.D., Illahi, F., Bashir, A.K.: Detection and prediction of traffic accidents using deep learning techniques. Cluster Comput. 26(1), 477–493 (2022). https://doi.org/10.1007/s10586-021-03502-1
Zago, M., Nespoli, P., Papamartzivanos, D., Perez, M.G., Marmol, F.G., Kambourakis, G., Perez, G.M.: Screening out social bots interference: are there any silver bullets? IEEE Commun. Mag. 57(8), 98–104 (2019). https://doi.org/10.1109/MCOM.2019.1800520
Gyongyi, Z., Koutrika, G., Pedersen, J., Garcia-Molina, H.: Questioning yahoo! answers. Technical Report 2007-35, Stanford InfoLab (2007). http://ilpubs.stanford.edu:8090/819/
Diyanati, A., Sheykhahmadloo, B.S., Fakhrahmad, S.M., Sadredini, M.H., Diyanati, M.H.: A proposed approach to determining expertise level of stackoverflow programmers based on mining of user comments. J. Comput. Lang. 61, 101000 (2020). https://doi.org/10.1016/j.cola.2020.101000
Roy, P.K.: Multilayer convolutional neural network to filter low quality content from quora. Neural Process. Lett. 52(1), 805–821 (2020). https://doi.org/10.1007/s11063-020-10284-x
Farrugia, L., Lauri, M.A., Borg, J., O’Neill, B.: Have you asked for it? an exploratory study about maltese adolescents’ use of ask.fm. J. Adolesc. Res. 34(6), 738–756 (2019)
Zhang, J., Chen, Y., Zhao, Y., Wolfram, D., Ma, F.: Public health and social media: a study of zika virus-related posts on yahoo! answers. J. Assoc.Inf. Sci. Technol. 71(3), 282–299 (2020). https://doi.org/10.1002/asi.24245
Zhao, Y., Wu, L., Zhang, J., Le, T.: How question characteristics impact answer outcomes on social question-and-answer websites. J. Glob. Inf. Manag. 29(6), 1–21 (2021)
Patil, S., Lee, K.: Detecting experts on quora: by their activity, quality of answers, linguistic characteristics and temporal behaviors. Soc. Netw. Anal. Min. 6(1), 5 (2016). https://doi.org/10.1007/s13278-015-0313-x
Wang, G., Gill, K., Mohanlal, M., Zheng, H., Zhao, B.Y.: Wisdom in the social crowd: An analysis of quora. In: Proceedings of the 22nd International Conference on World Wide Web. WWW ’13, (pp. 1341–1352). Association for Computing Machinery, New York, (2013). https://doi.org/10.1145/2488388.2488506
Anderson, K.E.: Ask me anything: what is Reddit? Libr. Hi Tech News 32, 8–11 (2015)
Adamic, L.A., Zhang, J., Bakshy, E., Ackerman, M.S.: Knowledge sharing and yahoo answers: Everyone knows something. In: Proceedings of the 17th International Conference on World Wide Web. WWW ’08, pp. 665–674. Association for Computing Machinery, New York, (2008). https://doi.org/10.1145/1367497.1367587
Qian, L., Wang, J., Lin, H., Xu, B., Yang, L.: Heterogeneous information network embedding based on multiperspective metapath for question routing. Knowl.-Based Syst. 240, 107842 (2022). https://doi.org/10.1016/j.knosys.2021.107842
Kassing, S., Oosterman, J., Bozzon, A., Houben, G.-J.: Locating domain-specific contents and experts on social bookmarking communities. In: Proceedings of the 30th Annual ACM Symposium on Applied Computing. SAC ’15, (pp. 747–752). Association for Computing Machinery, New York, (2015). https://doi.org/10.1145/2695664.2695777
Choi, D., Han, J., Chung, T., Ahn, Y.-Y., Chun, B.-G., Kwon, T.T.: Characterizing conversation patterns in Reddit: From the perspectives of content properties and user participation behaviors. In: Proceedings of the 2015 ACM on Conference on Online Social Networks. COSN ’15, pp. 233–243. Association for Computing Machinery, New York, (2015). https://doi.org/10.1145/2817946.2817959
van Dijk, D., Tsagkias, M., de Rijke, M.: Early detection of topical expertise in community question answering. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. SIGIR ’15, pp. 995–998. Association for Computing Machinery, New York, (2015). https://doi.org/10.1145/2766462.2767840
Egghe, L.: Theory and practise of the g-index. Scientometrics 69(1), 131–152 (2006)
Faisal, M.S., Daud, A., Akram, A.U., Abbasi, R.A., Aljohani, N.R., Mehmood, I.: Expert ranking techniques for online rated forums. Comput. Human Behav. 100, 168–176 (2019). https://doi.org/10.1016/j.chb.2018.06.013
Riahi, F., Zolaktaf, Z., Shafiei, M., Milios, E.: Finding expert users in community question answering. In: Proceedings of the 21st International Conference on World Wide Web. WWW ’12 Companion, pp. 791–798. Association for Computing Machinery, New York, (2012). https://doi.org/10.1145/2187980.2188202
Bouguessa, M., Dumoulin, B., Wang, S.: Identifying authoritative actors in question-answering forums: The case of yahoo! answers. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’08, pp. 866–874. Association for Computing Machinery, New York, (2008). https://doi.org/10.1145/1401890.1401994
Zhao, Z., Zhang, L., He, X., Ng, W.: Expert finding for question answering via graph regularized matrix completion. IEEE Trans. Knowl. Data Eng. 27(4), 993–1004 (2015). https://doi.org/10.1109/TKDE.2014.2356461
Sumanth, P., Rajeshwari, K.: Discovering top experts for trending domains on stack overflow. Procedia Comput. Sci. 143, 333–340 (2018)
Jurczyk, P., Agichtein, E.: Discovering authorities in question answer communities by using link analysis. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management. CIKM ’07, pp. 919–922. Association for Computing Machinery, New York, (2007). https://doi.org/10.1145/1321440.1321575
Gobert, J.D., Pedro, M.S., Raziuddin, J., Baker, R.S.: From log files to assessment metrics: measuring students’ science inquiry skills using educational data mining. J. Learn. Sci. 22(4), 521–563 (2013). https://doi.org/10.1080/10508406.2013.837391
Strukova, S., Ruipérez-Valiente, J.A., Mármol, F.G.: Towards the identification of experts in informal learning portals at scale. In: Proceedings of the Tenth ACM Conference on Learning @ Scale. L@S (2023). https://doi.org/10.1145/3573051.3596179
Amaya, A., Bach, R., Keusch, F., Kreuter, F.: New data sources in social science research: things to know before working with reddit data. Soc. Sci. Comput. Rev. 39(5), 943–960 (2021). https://doi.org/10.1177/0894439319893305
Saltz, J.S.: The need for new processes, methodologies and tools to support big data teams and improve big data project effectiveness. In: 2015 IEEE International Conference on Big Data (Big Data), pp. 2066–2071 (2015). https://doi.org/10.1109/BigData.2015.7363988
Xin, D., Ma, L., Liu, J., Macke, S., Song, S., Parameswaran, A.: Accelerating human-in-the-loop machine learning: Challenges and opportunities. In: Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning. DEEM’18. Association for Computing Machinery, New York, (2018). https://doi.org/10.1145/3209889.3209897
Monarch, R.M.: Human-in-the-Loop Machine Learning: Active Learning and Annotation for Human-centered AI. Simon and Schuster, New York (2021)
Baker, R., de Carvalho, A.: Labeling student behavior faster and more precisely with text replays. In: Educational Data Mining (2008)
Das, M., Cui, R., Campbell, D.R., Agrawal, G., Ramnath, R.: Towards methods for systematic research on big data. In: 2015 IEEE International Conference on Big Data (Big Data), pp. 2072–2081 (2015). https://doi.org/10.1109/BigData.2015.7363989
Kanan, T., Mughaid, A., Al-Shalabi, R., Al-Ayyoub, M., Elbes, M., Sadaqa, O.: Business intelligence using deep learning techniques for social media contents. Cluster Comput. (2022). https://doi.org/10.1007/s10586-022-03626-y
Farzindar, A., Inkpen, D.: Natural language processing for social media. Synth. Lect. Hum. Lang. Technol. 8(2), 1–166 (2015)
Ferrer, X., van Nuenen, T., Such, J.M., Criado, N.: Discovering and categorising language biases in Reddit. Proc. Int. AAAI Conf. Web Soc. Media 15(1), 140–151 (2021). https://doi.org/10.1609/icwsm.v15i1.18048
Nanomi Arachchige, I.A., Sandanapitchai, P., Weerasinghe, R.: Investigating machine learning & natural language processing techniques applied for predicting depression disorder from online support forums: a systematic literature review. Information (2021). https://doi.org/10.3390/info12110444
Yan, X., Yang, J., Obukhov, M., Zhu, L., Bai, J., Wu, S., He, Q.: Social skill validation at linkedin. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. KDD ’19, pp. 2943–2951. Association for Computing Machinery, New York (2019). https://doi.org/10.1145/3292500.3330752
Jindal, R., Singh, I.: Detecting malicious transactions in database using hybrid metaheuristic clustering and frequent sequential pattern mining. Cluster Comput. 25(6), 3937–3959 (2022). https://doi.org/10.1007/s10586-022-03622-2
Parra-Arnau, J., Mármol, F.G., Rebollo-Monedero, D., Forné, J.: Shall i post this now? optimized, delay-based privacy protection in social networks. Knowl. Inf. Syst. 52(1), 113–145 (2017). https://doi.org/10.1007/s10115-016-1010-4
Pastor-Galindo, J., Zago, M., Nespoli, P., Bernal, S.L., Celdrán, A.H., Pérez, M.G., Ruipérez-Valiente, J.A., Pérez, G.M., Mármol, F.G.: Spotting political social bots in twitter: a use case of the 2019 Spanish general election. IEEE Trans. Netw. Serv. Manag. 17(4), 2156–2170 (2020). https://doi.org/10.1109/TNSM.2020.3031573
Bevilacqua, M., Ciarapica, F.E.: Human factor risk management in the process industry: a case study. Reliab. Eng. Syst. Saf. 169, 149–159 (2018). https://doi.org/10.1016/j.ress.2017.08.013
Alyafeai, Z., AlShaibani, M.S., Ahmad, I.: A Survey on Transfer Learning in Natural Language Processing (2020)
Provost, F., Fawcett, T.: Data Science for Business: What You Need to Know About Data Mining and Data-analytic Thinking. O’Reilly Media Inc, New York (2013)
Dhar, V.: Data science and prediction. Commun. ACM 56(12), 64–73 (2013). https://doi.org/10.1145/2500499
Wing, J.M.: Computational thinking. Commun. ACM 49(3), 33–35 (2006). https://doi.org/10.1145/1118178.1118215
Plaza, P., Castro, M., Sáez-López, J.M., Sancristobal, E., Gil, R., Menacho, A., García-Loro, F., Quintana, B., Martin, S., Blázquez, M., et al.: Promoting computational thinking through visual block programming tools. In: 2021 IEEE Global Engineering Education Conference (EDUCON), pp. 1131–1136 (2021). https://doi.org/10.1109/EDUCON46332.2021.9453903
Loria, S.: textblob documentation. Release 0.15 2, 269 (2018)
Fast, E., Chen, B., Bernstein, M.S.: Empath: Understanding topic signals in large-scale text. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. CHI ’16, pp. 4647–4657. Association for Computing Machinery, New York, (2016). https://doi.org/10.1145/2858036.2858535