Proposed Model for Context Topic Identification of English and Hindi News Article Through LDA Approach with NLP Technique

Journal of The Institution of Engineers (India): Series B - Tập 103 - Trang 591-597 - 2021

Anukriti Srivastav¹, Satwinder Singh¹

¹Centre for Computer Science and Technology, Central University of Punjab, Bathinda, India

Tóm tắt

According to the survey, India has the world's second-largest newspaper market, with more than 100 K newspaper outlets, approx 240 million circulation, and 1300 million subscribers or readers. The topic modeling work is increasing day by day, and researchers have published multiple topic modeling papers and have implemented them in different areas like software engineering, political science and medical, etc. LDA topic modeling is used in this research because it has been introduced successfully for topic modeling and classification and it measures the probability of a text-dependent on the bag-of-words scheme without considering the word series. LDA is a common topic modeling algorithm with excellent implementation in the Gensim Python package. However, the challenge is how to extract good quality topics that are simple, separated, and meaningful. The purpose of this research deals with finding the main topics of the same category news articles which are in two different languages (Hindi and English) and then classifying these different language news topics with similarity measurement. In this research, the corpus is constructed with bigram. To achieve the research goal, we have to first build a headline and link extractor that scrap the top news from Google News feeds for both English and Hindi languages (Google News collects news stories that have appeared on different news website which is already accessible in 35 languages over the last 30 days) and then analyses which two news headlines are similar.

Tài liệu tham khảo

"How Much Data Does The World Generate Every Minute?" [Online]. https://www.domo.com/news/press/how-much-data-does-the-world-generate-every-minute D.M. Blei, Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012) M. David Mimno, M.W. Hanna, N. Naradowsky, A.S. David, "Polingual Topic Models," vol. Proceedings of the 2009 Conference on Empirical Methods, pp. 880–889, August (2009). MB. David, YN. Andrew, IJ. Michael, Latent Dirichlet Allocation. J. Machine Learning Res. 3, 993–1022 (2003) R. Alghamdi, K. Alfalqi, A survey of topic modeling in text mining. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 6(1), 147–153 (2015) H.Z.Z. Tong, A text mining research based on lda topic modelling. Jodrey School Comput Wolfville NS 10, 201–210 (2016) M. Hanna "Topic Modeling beyond bag-of-word," pp. 977–984, (2006). G. Atkins, M. Weigle, M. Nelso, "Measuring News Similarity Across Ten U.S. News Sites," pp. 1- 11, June (2018). P. Fung, "A Statistical View on Bilingual Lexicon Extraction: From Parallel Corpora to Non-Parallel Corpora," pp. 173–183, 1529 (1995). VJV. Thada, "Comparison of Jaccard, Dice, Cosine Similarity Coefficient To Find Best Fitness Value for Web Retrieved Documents Using Genetic Algorithm," International Journal of Innovations in Engineering and Technology (IJIET), vol. 2, pp. 202–205, August (2013). B. Dorr, RSM. Snover, "Language and Translation Model Adaptation using Comparable Corpora.," Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing., pp. 857–866, October (2008). S. Singh, R. Singh. Text Similarity Measures in News Articles by Vector Space Model, "The Institution of Engineers (India), pp.329–338 (2020).

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Về chúng tôi

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích các bài báo, công bố khoa học Việt Nam. Công cụ trợ giúp người nghiên cứu, tạp chí, đơn vị nghiên cứu tra cứu, phân tích và thống kê dữ liệu nghiên cứu khoa học tại Việt Nam và quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia vào Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA