Summarizing large text collection using topic modeling and clustering based on MapReduce framework

Journal of Big Data - Tập 2 - Trang 1-18 - 2015
N K Nagwani1
1Department of Computer Science and Engineering, National Institute of Technology, Raipur, Raipur, India

Tóm tắt

Document summarization provides an instrument for faster understanding the collection of text documents and has a number of real life applications. Semantic similarity and clustering can be utilized efficiently for generating effective summary of large text collections. Summarizing large volume of text is a challenging and time consuming problem particularly while considering the semantic similarity computation in summarization process. Summarization of text collection involves intensive text processing and computations to generate the summary. MapReduce is proven state of art technology for handling Big Data. In this paper, a novel framework based on MapReduce technology is proposed for summarizing large text collection. The proposed technique is designed using semantic similarity based clustering and topic modeling using Latent Dirichlet Allocation (LDA) for summarizing the large text collection over MapReduce framework. The summarization task is performed in four stages and provides a modular implementation of multiple documents summarization. The presented technique is evaluated in terms of scalability and various text summarization parameters namely, compression ratio, retention ratio, ROUGE and Pyramid score are also measured. The advantages of MapReduce framework are clearly visible from the experiments and it is also demonstrated that MapReduce provides a faster implementation of summarizing large text collections and is a powerful tool in Big Text Data analysis.

Tài liệu tham khảo

Turpin A, Tsegay Y, Hawking D, Williams H (2007) Fast generation of result snippets in web search. Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, Amsterdam, Canada, pp 127–134 Sampath G, Martinovic M (2002) Proceedings of the 6th International Conference on Applications of Natural Language to Information Systems, NLDB 2002, 2002nd edn. Proceedings of the 6th International Conference on Applications of Natural Language to Information Systems, Stockholm, Sweden, pp 208–212 Dean J, Ghemawat S (2004) MapReduce: Simplified data processing on large clusters. Proc. of the 6th Symposium on Operating System Design and Implementation (OSDI 2004). San Francisco, California, pp 137–150 Dean J, Ghemawat S (2010) MapReduce: A flexible data processing tool. Commun ACM 53(1):72–77 Borthakur, D. (2007) The hadoop distributed file system: Architecture and design. Hadoop Project Website (Available online at - https://hadoop.apache.org/docs/r1.2.1/hdfs_design.pdf). p 1–14 Accessed 15 April 2014 Steve L (2012) The Age of Big Data. Big Data’s Impact in the World, New York, USA, pp 1–5 Russom P (2011) Big Data Analytics. TDWI Research Report, US, pp 1–38 McAfee A, Brynjolfsson E (2012) Big Data: The Management Revolution. Harv Bus Rev 90(10):60–68 Li F, Ooi BC, Özsu MT, Wu S (2013) Distributed Data Management Using MapReduce. ACM Computing Surveys 46:1–41 Shim K (2013) MapReduce Algorithms for Big Data Analysis. Databases in Networked Information Systems, Springer, Berlin, Heidelberg, Germany, pp 44–48 Shim K (2012) MapReduce Algorithms for Big Data Analysis, Framework. Proceedings of the VLDB Endowment 5(12):2016–2017 Lee K-H, Lee Y-J, Choi H, Chung YD, Moon B (2011) Parallel Data Processing with MapReduce: A Survey. ACM SIGMOD Record 40(4):11–20 Yang J, Li X (2013) MapReduce Based Method for Big Data Semantic Clustering. In Systems, Man, and Cybernetics (SMC), 2013 IEEE International Conference. Manchester, England, pp 2814–2819 Ene A, Im S, Moseley B (2011) Fast Clustering using MapReduce. Proc. of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, USA, pp 681–689 Kolb L, Thor A, Rahm E (2013) Don’t Match Twice: Redundancy-free Similarity Computation with MapReduce. Proc. of the Second Workshop on Data Analytics in the Cloud, ACM, New York, USA, pp 1–5 Esteves RM, Rong C (2011) Using Mahout for clustering Wikipedia’s latest articles: a comparison between K-means and fuzzy C-means in the cloud. In Cloud Computing Technology and Science (CloudCom), 2011 IEEE Third International Conference. Athens, Greece, pp 565–569 Li HG, Wu GQ, Hu XG, Zhang J, Li L, Wu X (2011) K-means clustering with bagging and mapreduce. Proc. 2011 44th Hawaii International Conference on IEEE System Sciences (HICSS). Kauai/Hawaii, US, pp 1–8 Zhang G, Zhang M (2013) The Algorithm of Data Preprocessing in Web Log Mining Based on Cloud Computing. In 2012 International Conference on Information Technology and Management Science (ICITMS 2012) Proceedings Springer. Berlin, Heidelberg, Germany, pp 467–474 Morales GDF, Gionis A, Sozio M (2011) Social content matching in mapreduce. Proceedings of the VLDB Endowment 4(7):460–469 Verma A, Llora X, Goldberg DE, Campbell RH (2009) Scaling Genetic algorithms using MapReduce. Intelligent Systems Design and Application(ISDA). Ninth International Conference, Pisa, Italy, pp 13–18 Cambria E, Rajagopal D, Olsher D, Das D (2013) Big Social Data Analysis. Big Data Computing Chapter 13:401–414 Lieberman M (2014) Visualizing Big Data: Social Network Analysis. Digital Research Conference, San Antonio, Texas, pp 1–23 López V, Río SD, Benítez JM, Herrera F (2014) Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data. Fuzzy Sets Syst 1:1–34 Blanas S, Patel JM, Ercegovac V, Rao J, Shekita EJ, Tian Y (2010) A Comparison of Join Algorithms for Log Processing in MapReduce. Proc. of the 2010 ACM SIGMOD International Conference on Management of data. New York, USA, pp 975–986 Hoi SCH, Wang J, Zhao P, Jin R (2012) Online Feature Selection for Mining Big Data. Proc. of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms. Systems, Programming Models and Applications, ACM, New York, USA, pp 93–100 Chen S-Y, Li J-H, Lin K-C, Chen H-M, Chen T-S (2013) Using MapReduce Framework for Mining Association Rules. In Information Technology Convergence Springer, Netherlands, pp 723–731 Urbani J, Maassen J, Bal H (2010) Massive Semantic Web data compression with MapReduce. Proc. of the 19th ACM International Symposium on High Performance Distributed Computing. New York, USA, pp 795–802 Rajdho A, Biba M (2013) Plugging Text Processing and Mining in a Cloud Computing Framework. In Internet of Things and Inter-cooperative Computational Technologies for Collective Intelligence Springer, Berlin, Heidelberg, Germany, pp 369–390 Balkir AS, Foster I, Rzhetsky A (2011) A Distributed Look-up Architecture for Text Mining Applications using MapReduce. High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference. Seattle, US, pp 1–11 Zongzhen H, Weina Z, Xiaojuan D (2013) A fuzzy approach to clustering of text documents based on MapReduce. In Computational and Information Sciences (ICCIS), 2013 Fifth International Conference on IEEE. Shiyang, China, pp 666–669 Chen F, Hsu M (2013) A Performance Comparison of Parallel DBMSs and MapReduce on Large-Scale Text Analytics. Proc. of the 16th International Conference on Extending Database Technology ACM. New York, USA, pp 613–624 Das TK, Kumar PM (2013) BIG Data Analytics: A Framework for Unstructured Data Analysis. International Journal of Engineering and Technology (IJET) 5(1):153–156 Momtaz A, Amreen S (2012) Detecting Document Similarity in Large Document Collection using MapReduce and the Hadoop Framework . BS Thesis. BRAC University, Dhaka, Bangladesh, pp 1–54 Lin J, Dyer C (2010) Data-Intensive Text Processing with MapReduce. Morgan & Claypool Publishers 3(1):1–177 Elsayed T, Lin J, Oard DW (2008) Pairwise Document Similarity in Large Collections with MapReduce. Proc. of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies. Stroudsburg, US, pp 265–268 Galgani F, Compton P, Hoffmann A (2012) Citation based summarisation of legal texts. Proc. of 12th Pacific Rim International Conference on Artificial Intelligence. Kuching, Malaysia, pp 40–52 Hassel M (2004) Evaluation of Automatic Text Summarization. Licentiate Thesis, Stockholm, Sweden, pp 1–75 Hu Q, Zou X (2011) Design and implementation of multi-document automatic summarization using MapReduce. Computer Engineering and Applications 47(35):67–70 Lai C, Renals S (2014) Incorporating Lexical and Prosodic Information at Different Levels for Meeting Summarization, Proceedings of the 15th Annual Conference of the International Speech Communication Association, INTERSPEECH 2014. ISCA, Singapore, pp 1875–1879 Fowkes J, Ranca R, Allamanis M, Lapata M, Sutton C (2014) Autofolding for Source Code Summarization. Computing Research Repository 1403(4503):1–12 Tzouridis E, Nasir JA, Lahore LUMS, Brefeld U (2014) Learning to Summarise Related Sentences. The 25th International Conference on Computational Linguistics (COLING’14), Dublin, Ireland, pp 1–12, ACL Wang Y, Bai H, Stanton M, Chen WY, Chang EY (2009) Plda: Parallel latent dirichlet allocation for large-scale applications. 5th International Conference, AAIM (Algorithmic Aspects in Information and Management), San Francisco, CA, USA, pp 309–322 Miller GA (1995) WordNet: a lexical database for English. Commun ACM 38(11):39–41 Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet Allocation. The Journal of Machine Learning Research 3:993–1022 Feldman R, Sanger J (2007) The Text Mining Handbook - Advanced Approaches In Analyzing Unstructured Data. Press, Cambridge University. ISBN 978-0-521-83657-9 McCallum A K (2002) Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu/. Accessed 10 May 2014 Galgani F, Compton P, Hoffmann A (2012) Combining Different Summarization Techniques for Legal Text. Proc. of the Workshop on Innovative Hybrid Approaches to the Processing of Textual Data. Association for Computational Linguistics, Avignon, France, pp 115–123 Galgani F, Compton P, Hoffmann A (2014) HAUSS: Incrementally building a summarizer combining multiple techniques, Int. J. Human-Computer Studies 72:584–605 Li W (1992) Random Texts Exhibit Zipf’s-Law-Like Word Frequency Distribution. IEEE Trans Inf Theory 38(6):1842–1845 Reed WJ (2001) The Pareto, Zipf and other power laws. Econ Lett 74(1):15–19 Goldstein J, Mittal V, Carbonell JG, Kantrowitz M (2000) Multi-Document Summarization By Sentence Extraction. School of Computer Science, Carnegie Mellon University, Research Showcase, pp 40–48 Lin CY (2004) Rouge: a package for automatic evaluation of summaries. In: Out TSB (ed) Proceedings of the ACL-04 Workshop. Association for Computational Linguistics, Barcelona, Spain, pp 74–81 Nenkova A, Passonneau R (2004) Evaluating Content Selection in Summarization: The Pyramid Method. Proc. Human Language Technology Conf. North Am. Chapter of the Assoc. for Computational Linguistics (HLT-NAACL), Boston, Massachusetts, pp 145–152 Harnly A, Nenkova A, Passonneau R, Rambow O (2005) Automation of Summary Evaluation by the Pyramid Method, In Recent Advances in Natural Language Processing (RANLP). Borovets, Bulgaria, pp 226–232 Qazvinian V, Radev DR (2008) Scientific Paper Summarization Using Citation Summary Networks. Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, Stroudsburg, PA, pp 689–696 Wang D, Li T (2012) Weighted Consensus Multi-document Summarization. Inf Process Manag 48:513–523 Amdahl GM (1967) Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities. Proceedings of the April 18–20, 1967, spring joint computer conference. Atlantic City, New Jersey, USA, pp 483–485