Tận dụng học sâu để sàng lọc tài liệu tự động trong thư mục học thông minh

Xieling Chen1, Haoran Xie2, Zongxi Li3, Dian Zhang4, Gary Cheng5, Fu Lee Wang3, Hong-Ning Dai6, Qing Li7
1School of Education, Guangzhou University, Guangzhou, China
2Department of Computing and Decision Sciences, Lingnan University, Hong Kong SAR, China
3School of Science and Technology, Hong Kong Metropolitan University, Hong Kong SAR, China
4College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China
5Department of Mathematics and Information Technology, The Education University of Hong Kong, Hong Kong SAR, China
6Department of Computer Science, Hong Kong Baptist University, Hong Kong SAR, China
7Department of Computing, The Hong Kong Polytechnic University, Hong Kong SAR, China

Tóm tắt

Thư mục học thông minh, bằng cách cung cấp thông tin thống kê đầy đủ dựa trên phân tích dữ liệu văn học quy mô lớn, hứa hẹn sẽ giúp hiểu rõ các con đường đổi mới, cung cấp những hiểu biết có ý nghĩa với sự hỗ trợ của kiến thức chuyên gia, và chỉ định các lĩnh vực chính của các cuộc khảo sát khoa học. Tuy nhiên, sự gia tăng theo cấp số nhân của đầu ra công bố khoa học toàn cầu trong hầu hết các lĩnh vực khoa học hiện đại khiến việc phân tích văn học với khối lượng lớn trở nên cực kỳ khó khăn và tốn nhiều công sức. Nghiên cứu này nhằm tăng tốc phân tích văn học dựa trên thư mục học thông minh bằng cách tận dụng học sâu cho việc sàng lọc tài liệu tự động. Việc so sánh các thuật toán học máy khác nhau cho việc phân loại tự động tài liệu theo độ liên quan đến một chủ đề nghiên cứu cụ thể cho thấy hiệu suất xuất sắc của học sâu. Nghiên cứu này cũng so sánh các đặc trưng khác nhau như đầu vào mô hình và cung cấp những gợi ý về kích thước tập dữ liệu đào tạo. Bằng cách tận dụng khả năng của học sâu trong phân tích dữ liệu dự đoán và dữ liệu lớn, nghiên cứu này đóng góp vào thư mục học thông minh bằng cách thúc đẩy việc sàng lọc tài liệu và hứa hẹn theo dõi những thay đổi công nghệ và các con đường tiến hóa khoa học.

Từ khóa

#Thư mục học thông minh #học sâu #phân tích văn học tự động #thuật toán học máy #dữ liệu lớn

Tài liệu tham khảo

Graham S, Depp C, Lee EE et al (2019) Artificial intelligence for mental health and mental illnesses: an overview. Curr Psychiatry Rep 21:116 Chen X, Xie H, Cheng G et al (2020) Trends and features of the applications of natural language processing techniques for clinical trials text analysis. Appl Sci 10:2157 Balakrishnan N, Rajendran A, Palanivel K (2019) Meticulous fuzzy convolution C means for optimized big data analytics: adaptation towards deep learning. Int J Mach Learn Cybern 10:3575–3586 Rowley J, Slack F (2004) Conducting a literature review. Manag Res news 27:31–39 Hart C (1998) Reviewing and the research imagination: doing a literature review. Sage, London Webster J, Watson RT (2002) Analyzing the past to prepare for the future: Writing a literature review. MIS Q xiii–xxiii Cronin P, Ryan F, Coughlan M (2008) Undertaking a literature review: a step-by-step approach. Br J Nurs 17:38–43 Zhang Y, Wu M, Hu Z et al (2021) Profiling and predicting the problem-solving patterns in china’s research systems: a methodology of intelligent bibliometrics and empirical insights. Quant Sci Stud 2:409–432 Vom Brocke J, Simons A, Riemer K et al (2015) Standing on the shoulders of giants: challenges and recommendations of literature search in information systems research. Commun Assoc Inf Syst 37:9 Cobo MJ, López-Herrera AG, Herrera-Viedma E, Herrera F (2012) SciMAT: a new science mapping analysis software tool. J Am Soc Inf Sci Technol 63:1609–1630 Åström F, Danell R, Larsen B, Schneider J (2009) Celebrating scholarly communication studies: A Festschrift for Olle Persson at his 60th Birthday. ISSI Van Eck N, Waltman L (2010) Software survey: VOSviewer, a computer program for bibliometric mapping. Scientometrics 84:523–538 Chen C (2006) CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature. J Am Soc Inf Sci Technol 57:359–377 Van Eck NJ, Waltman L (2014) CitNetExplorer: a new software tool for analyzing and visualizing citation networks. J Informetr 8:802–823 Team S (2009) Sci2 Tool: A Tool for Science of Science Research and Practice. https://sci2.cns.iu.edu. Bastian M, Heymann S, Jacomy M (2009) Gephi: an open source software for exploring and manipulating networks. In: Proceedings of the International AAAI Conference on Web and Social Media (Volume 3), pp 361–362. Retrieved from https://ojs.aaai.org/index.php/ICWSM/article/view/13937 Grauwin S, Jensen P (2011) Mapping scientific institutions. Scientometrics 89:943–954 Belter CW (2016) Citation analysis as a literature search method for systematic reviews. J Assoc Inf Sci Technol 67:2766–2777 Hearst MA (1999) Untangling text data mining. In: Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics. Association for Computational Linguistics, pp 3–10 Raudys S, Pikelis V (1980) On dimensionality, sample size, classification error, and complexity of classification algorithm in pattern recognition. IEEE Trans Pattern Anal Mach Intell 2:242–252 Taha A, Cosgrave B, Mckeever S (2022) Using feature selection with machine learning for generation of insurance insights. Appl Sci 12:3209 Langley P, Iba W (1993) Average-case analysis of a nearest neighbor algorithm. In: IJCAI. Citeseer, p 889 Saarela M, Jauhiainen S (2021) Comparison of feature importance measures as explanations for classification models. SN Appl Sci 3:1–12 Kwon O, Sim JM (2013) Effects of data set features on the performances of classification algorithms. Expert Syst Appl 40:1847–1857 Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182 Peng Y, Wang G, Kou G, Shi Y (2011) An empirical study of classification algorithm evaluation for financial risk prediction. Appl Soft Comput 11:2906–2915 Althnian A, AlSaeed D, Al-Baity H et al (2021) Impact of dataset size on classification performance: an empirical evaluation in the medical domain. Appl Sci 11:796 Prusa J, Khoshgoftaar TM, Seliya N (2015) The effect of dataset size on training tweet sentiment classifiers. In: 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA). IEEE, pp 96–102 Rahman MS, Sultana M (2017) Performance of Firth-and logF-type penalized methods in risk prediction for small or sparse binary data. BMC Med Res Methodol 17:1–15 Collins JA, Fauser BCJM (2005) Balancing the strengths of systematic and narrative reviews. Oxford University Press, Oxford Boell SK, Cecez-Kecmanovic D On Being ‘Systematic’in Literature Reviews in IS. In Formulating Research Methods for Information Systems. Springer, pp 8–78 Bernardo WM, Nobre MRC, Jatene FB (2004) Evidence based clinical practice: part II-searching evidence databases. Rev Assoc Med Bras 50:104–108 Parahoo K (2006) Nursing research: principles, process and issues. Bloomsbury Publishing, London Dunn TJ, Kennedy M (2019) Technology enhanced learning in higher education; motivations, engagement and academic achievement. Comput Educ 137:104–113 Xie H, Chu H-C, Hwang G-J, Wang C-C (2019) Trends and development in technology-enhanced adaptive/personalized learning: a systematic review of journal publications from 2007 to 2017. Comput Educ 140:103599 Ramos-Rodríguez A, Ruíz-Navarro J (2004) Changes in the intellectual structure of strategic management research: a bibliometric study of the Strategic Management Journal, 1980–2000. Strateg Manag J 25:981–1004 Gimenez E, Salinas M, Manzano-Agugliaro F (2018) Worldwide research on plant defense against biotic stresses as improvement for sustainable agriculture. Sustainability 10:391 Chen X, Xie H, Wang FL et al (2018) A bibliometric analysis of natural language processing in medical research. BMC Med Inform Decis Mak 18:1–14 Song Y, Chen X, Hao T et al (2019) Exploring two decades of research on classroom dialogue by using bibliometric analysis. Comput Educ 137:12–31 Howard BE, Phillips J, Miller K et al (2016) SWIFT-Review: a text-mining workbench for systematic review. Syst Rev 5:1–16 Scells H, Zuccon G, Koopman B, et al (2017) A test collection for evaluating retrieval of studies for inclusion in systematic reviews. In: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. pp 1237–1240 Shojania KG, Sampson M, Ansari MT et al (2007) How quickly do systematic reviews go out of date? A survival analysis. Ann Intern Med 147:224–233 Zhang Y, Liang S, Feng Y et al (2022) Automation of literature screening using machine learning in medical evidence synthesis: a diagnostic test accuracy systematic review protocol. Syst Rev 11:1–7 Lee S, Kim D, Lee K et al (2016) BEST: next-generation biomedical entity search tool for knowledge discovery from biomedical literature. PLoS ONE 11:e0164680 Petticrew M, Roberts H (2008) Systematic reviews in the social sciences: a practical guide. John Wiley & Sons, New York Kho ME, Brouwers MC (2012) The systematic review and bibliometric network analysis (SeBriNA) is a new method to contextualize evidence. Part 1: description. J Clin Epidemiol 65:1010–1015 Robinson KA, Dunn AG, Tsafnat G, Glasziou P (2014) Citation networks of related trials are often disconnected: implications for bidirectional citation searches. J Clin Epidemiol 67:793–799 Bernstam EV, Herskovic JR, Aphinyanaphongs Y et al (2006) Using citation data to improve retrieval from MEDLINE. J Am Med Informatics Assoc 13:96–105 Bunn F, Trivedi D, Alderson P et al (2014) The impact of Cochrane systematic reviews: a mixed method evaluation of outputs from Cochrane Review Groups supported by the UK National Institute for Health Research. Syst Rev 3:125 Royle P, Kandala N-B, Barnard K, Waugh N (2013) Bibliometrics of systematic reviews: analysis of citation rates and journal impact factors. Syst Rev 2:74 O’Mara-Eves A, Brunton G, McDaid D et al (2014) Techniques for identifying cross-disciplinary and ‘hard-to-detect’evidence for systematic review. Res Synth Methods 5:50–59 Shemilt I, Simon A, Hollands GJ et al (2014) Pinpointing needles in giant haystacks: use of text mining to reduce impractical screening workload in extremely large scoping reviews. Res Synth Methods 5:31–49 Adeva JJG, Atxa JMP, Carrillo MU, Zengotitabengoa EA (2014) Automatic text classification to support systematic reviews in medicine. Expert Syst Appl 41:1498–1508 Yu Z, Menzies T (2019) FAST2: an intelligent assistant for finding relevant papers. Expert Syst Appl 120:57–71 van Dinter R, Catal C, Tekinerdogan B (2021) A decision support system for automating document retrieval and citation screening. Expert Syst Appl 182:115261 Colón-Ruiz C, Segura-Bedmar I (2020) Comparing deep learning architectures for sentiment analysis on drug reviews. J Biomed Inform 110:103539 Kontonatsios G, Spencer S, Matthew P, Korkontzelos I (2020) Using a neural network-based feature extraction method to facilitate citation screening for systematic reviews. Expert Syst with Appl X 6:100030 Ros R, Bjarnason E, Runeson P (2017) A machine learning approach for semi-automated search and selection in literature studies. In: Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering. pp 118–127 Watanabe WM, Felizardo KR, Candido A Jr et al (2020) Reducing efforts of software engineering systematic literature reviews updates using text classification. Inf Softw Technol 128:106395 Xiong Z, Liu T, Tse G et al (2018) A machine learning aided systematic review and meta-analysis of the relative risk of atrial fibrillation in patients with diabetes mellitus. Front Physiol 9:835 Timsina P, Liu J, El-Gayar O, Shang Y (2016) Using semi-supervised learning for the creation of medical systematic review: An exploratory analysis. In: 2016 49th Hawaii International Conference on System Sciences (HICSS). IEEE, pp 1195–1203 Yu Z, Kraft NA, Menzies T (2018) Finding better active learners for faster literature reviews. Empir Softw Eng 23:3161–3186 Wang D, Weisz JD, Muller M, et al (2019) Human-AI collaboration in data science: Exploring data scientists’ perceptions of automated AI. Proc ACM Human-Computer Interact, pp 1–14. Oussous A, Benjelloun FZ, Ait Lahcen A, Belfkih S (2018) Big Data technologies: a survey. J King Saud Univ Comput Inf Sci 30:431–448 Kim B, Yoo M, Park KC et al (2021) A value of civic voices for smart city: a big data analysis of civic queries posed by Seoul citizens. Cities 108:102941 Ha T, Beijnon B, Kim S et al (2017) Examining user perceptions of smartwatch through dynamic topic modeling. Telemat Informat 34:1262–1273 Barnett GA, Ruiz JB, Xu WW et al (2017) The world is not flat: evaluating the inequality in global information gatekeeping through website co-mentions. Technol Forecast Soc Change 117:38–45 Barnett GA, Benefield GA (2017) Predicting international Facebook ties through cultural homophily and other factors. New Media Soc 19:217–239 Cheah S, Wang S (2017) Big data-driven business model innovation by traditional industries in the Chinese economy. J Chinese Econ Foreign Trade Stud 10:229–251 Lewis DD (1998) Naive (Bayes) at forty: The independence assumption in information retrieval. In: European conference on machine learning. Springer, pp 4–15 McCallum A, Nigam K (1998) A comparison of event models for naive bayes text classification. In: AAAI-98 workshop on learning for text categorization. Citeseer, pp 41–48 Moschitti A (2003) A study on optimal parameter tuning for Rocchio text classifier. In: Sebastiani F (ed) European Conference on Information Retrieval. Springer, Berlin, pp 420–435 Jabbar MA, Deekshatulu BL, Chndra P (2014) Alternating decision trees for early diagnosis of heart disease. In: International Conference on Circuits, Communication, Control and Computing. IEEE, pp 322–328 Ali J, Khan R, Ahmad N, Maqsood I (2012) Random forests and decision trees. Int J Comput Sci Issues 9:272 Fawagreh K, Gaber MM, Elyan E (2014) Random forests: from early developments to recent advancements. Syst Sci Control Eng An Open Access J 2:602–609 Roy K, Kar S, Das RN (2015) Understanding the basics of QSAR for applications in pharmaceutical sciences and risk assessment. Academic press, Cambridge Socher R, Pennington J, Huang EH, et al (2011) Semi-supervised recursive autoencoders for predicting sentiment distributions. In: Proceedings of the conference on empirical methods in natural language processing. Association for Computational Linguistics, pp 151–161 Iyyer M, Enns P, Boyd-Graber J, Resnik P (2014) Political ideology detection using recursive neural networks. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 1113–1122 Kim Y (2014) Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 1746–1751 Mikolov T, Grave E, Bojanowski P, et al (2017) Advances in pre-training distributed word representations. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). Retrieved from https://arxiv.org/pdf/1712.09405.pdf Iglesias LL, Bellón PS, del Barrio AP et al (2021) A primer on deep learning and convolutional neural networks for clinicians. Insights Imaging 12:1–11 Yih W, He X, Meek C (2014) Semantic parsing for single-relation question answering. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp 643–648 Kalchbrenner N, Grefenstette E, Blunsom P (2014) A convolutional neural network for modelling sentences. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 655–665. Shen Y, He X, Gao J, et al (2014) Learning semantic representations using convolutional neural networks for web search. In: Proceedings of the 23rd International Conference on World Wide Web. ACM, pp 373–374 Collobert R, Weston J, Bottou L et al (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537 Liu P, Qiu X, Huang X (2016) Recurrent neural network for text classification with multi-task learning. In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pp 2873–2879. Golmohammadi M, Ziyabari S, Shah V, et al (2017) Gated recurrent networks for seizure detection. In: 2017 IEEE Signal Processing in Medicine and Biology Symposium, SPMB 2017—Proceedings. IEEE, pp 1–5. Cheng F, Zhao J (2019) A novel process monitoring approach based on feature points distance dynamic autoencoder. In: Computer Aided Chemical Engineering (Vol. 46). Elsevier, pp 757–762 Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780 Zhou C, Sun C, Liu Z, Lau F (2015) A C-LSTM neural network for text classification. Retrieved from https://doi.org/10.48550/arXiv.1511.08630 García Adeva JJ, Pikatza Atxa JM (2007) Intrusion detection in web applications using text mining. Eng Appl Artif Intell 20:555–566 Hao T, Chen X, Song Y (2020) A topic-based bibliometric analysis of two decades of research on the application of technology in classroom dialogue. J Educ Comput Res 58:1311–1341 Chen X, Gao D, Lun Y, et al (2019) The Analysis of Worldwide Research on Artificial Intelligence Assisted User Modeling. In: International Symposium on Emerging Technologies for Education. Springer, pp 201–213 Chen X, Zou D, Xie H et al (2022) A bibliometric analysis of game-based collaborative learning between 2000 and 2019. Int J Mob Learn Organ 16:20–51 Chen X, Zou D, Su F (2021) Twenty-five years of computer-assisted language learning: a topic modeling analysis. Lang Learn Technol 25:151–185 Yesir S, Soğukpinar İ (2021) Malware Detection and Classification Using fastText and BERT. In: 2021 9th International Symposium on Digital Forensics and Security (ISDFS). IEEE, pp 1–6 Sia S, Dalmia A, Mielke SJ (2020) Tired of topic models? clusters of pretrained word embeddings make for fast and good topics too! In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 1728–1736. Oral B, Emekligil E, Arslan S, Eryiǧit G (2020) Information extraction from text intensive and visually rich banking documents. Inf Process Manag 57:102361 Dufter P, Kassner N, Schütze H (2021) Static Embeddings as Efficient Knowledge Bases? In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 2353–2363 Magge A, Tutubalina E, Miftahutdinov Z et al (2021) DeepADEMiner: a deep learning pharmacovigilance pipeline for extraction and normalization of adverse drug event mentions on Twitter. J Am Med Inform Assoc 28:2184–2192 Tawfik NS, Spruit MR (2020) Evaluating sentence representations for biomedical text: methods and experimental results. J Biomed Inform 104:103396 Immer A, Hennigen LT, Fortuin V, Cotterell R (2022) Probing as Quantifying Inductive Bias. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 1839–1851 Balaji NNA, Bharathi B (2020) SSNCSE_NLP@ Fake news detection in the Urdu language (UrduFake) 2020. Health (Irvine Calif) 100:100 Zarate JMO de, Giovanni M Di, Feuerstein EZ, Brambilla M (2020) Measuring controversy in social networks through nlp. In: International Symposium on String Processing and Information Retrieval. Springer, pp 194–209 Hennigen LT, Williams A, Cotterell R (2020) Intrinsic probing through dimension selection. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp 197–216. Liu Z, Winata GI, Fung P (2020) Zero-resource cross-domain named entity recognition. In: Proceedings of the 5th Workshop on Representation Learning for NLP, pp 1–6. Hofstätter S, Hanbury A (2019) Let’s measure run time! Extending the IR replicability infrastructure to include performance aspects. Retrieved from https://doi.org/10.48550/arXiv.1907.04614 Islam KI, Islam MS, Amin MR (2020) Sentiment analysis in Bengali via transfer learning using multi-lingual BERT. In: 2020 23rd International Conference on Computer and Information Technology (ICCIT). IEEE, pp 1–5 Kucukyilmaz T, Cambazoglu BB, Aykanat C, Can F (2008) Chat mining: predicting user and message attributes in computer-mediated communication. Inf Process Manag 44:1448–1466