Paraphrase type identification for plagiarism detection using contexts and word embeddings

Faisal Alvi1, Mark Stevenson2, Paul Clough3
1Information and Computer Science Department, King Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia
2Department of Computer Science, University of Sheffield, Sheffield, United Kingdom
3Information School, University of Sheffield, Sheffield, United Kingdom

Tóm tắt

Paraphrase types have been proposed by researchers as the paraphrasing mechanisms underlying acts of plagiarism. Synonymous substitution, word reordering and insertion/deletion have been identified as some of the common paraphrasing strategies used by plagiarists. However, similarity reports generated by most plagiarism detection systems provide a similarity score and produce matching sections of text with their possible sources. In this research we propose methods to identify two important paraphrase types – synonymous substitution and word reordering in paraphrased, plagiarised sentence pairs. We propose a three staged approach that uses context matching and pretrained word embeddings for identifying synonymous substitution and word reordering. Our proposed approach indicates that the use of Smith Waterman Algorithm for Plagiarism Detection and ConceptNet Numberbatch pretrained word embeddings produces the best performance in terms of $$\hbox {F}_1$$ scores. This research can be used to complement similarity reports generated by currently available plagiarism detection systems by incorporating methods to identify paraphrase types for plagiarism detection.

Tài liệu tham khảo

Alvi, F., El-Alfy, E. S. M,. Al-Khatib, W. G., & Abdel-Aal, R. E. (2012). Analysis and Extraction of Sentence-Level Paraphrase Sub-Corpus in CS Education. In Proceedings of the 2012 ACM Conference of Special Interest Group on IT Education (SIGITE), Association of Computing Machinery, pp 49–54. Alzahrani, S. M., Salim, N., & Abraham, A. (2012). Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 42(2), 133–149. Barrón-Cedeño, A. (2012). On the Mono- and Cross-Language Detection of Text Re-use and Plagiarism. PhD thesis, Universitat Polytecnica De Valencia. Barrón-Cedeño, A., Vila, M., Martí, M. A., & Rosso, P. (2013). Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection. Computational Linguistics, 39(4), 917–947. Bensalem, I., Rosso, P., & Chikhi, S. (2019). On the use of character n-grams as the only intrinsic evidence of plagiarism. Language Resources and Evaluation, 53(3), 363–396. Bhagat, R. (2009). Learning paraphrases from text. PhD thesis, University of Southern California. Bhagat, R., & Hovy, E. H. (2013). What is a paraphrase? Computational Linguistics, 39(3), 463–472. Bisazza, A., & Federico, M. (2016). A survey of word reordering in statistical machine translation: computational models and language phenomena. Computational Linguistics, 42(2), 163–205. Bretag, T. (2018). Academic integrity. In Oxford Research Encyclopedia of Business and Management, Oxford University Press. Carmona, M. Á. Á., Franco-Salvador, M., Villatoro-Tello, E., Montes-y-Gómez, M., Rosso, P., & Pineda, L. V. (2018). Semantically-informed distance and similarity measures for paraphrase plagiarism identification. Journal of Intelligent and Fuzzy Systems, 34(5), 2983–2990. Chitra, A., & Rajkumar, A. (2016). Plagiarism detection using machine learning-based paraphrase recognizer. Journal of Intelligent Systems, 25(3), 351–359. Chong, M. (2013). A Study on Plagiarism Detection and Plagiarism Direction Identification using Natural Language Processing Techniques. PhD thesis, University of Wolverhampton. Clough, P. (2010). Measuring text reuse in the news industry. In: L. Bently , J. Davis & J. C. Ginsburg (Eds.), (pp. 247–259). Cambridge University Press: Copyright and Piracy. Clough, P., & Stevenson, M. (2011). Developing a corpus of plagiarised short answers. Language Resources and Evaluation, 45(1), 5–24. Denkowski, M., & Lavie, A. (2014). Meteor Universal: language specific translation evaluation for any target language. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation, pp 376–380. Dias, P. C., & Bastos, A. S. C. (2014). Plagiarism phenomenon in European Countries: results from GENIUS project. Procedia-Social and Behavioral Sciences, 116, 2526–2531. Dolan, B., Quirk, C., & Brockett, C. (2004). Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In Proceedings of the 20th International Conference on Computational Linguistics, Association for Computational Linguistics. Dolan, W. B., & Brockett, C. (2005). Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), Asia Federation of Natural Language Processing. Fatima, A., Abbas, A., Ming, W., Hosseini, S., & Zhu, D. (2019). Internal and external factors of plagiarism: evidence from Chinese public sector universities. Accountability in Research, 26(1), 1–16. https://doi.org/10.1080/08989621.2018.1552834. Foltýnek, T., Meuschke, N., & Gipp, B. (2019). Academic plagiarism detection: a systematic literature review. ACM Computing Surveys, 52(6), 1–42. https://doi.org/10.1145/3345317. Foltỳnek, T., Dlabolová, D., Anohina-Naumeca, A., Razı, S., Kravjar, J., Kamzola, L., et al. (2020). Testing of support tools for plagiarism detection. International Journal of Educational Technology in Higher Education, 17(46). Freitag, D., Blume, M., Byrnes, J., Chow, E., Kapadiam, S., Rohwer, R., & Wang, Z. (2005). New Experiments in Distributional Representations of Synonymy. In Proceedings of the Ninth Conference on Computational Natural Language Learning, Association for Computational Linguistics, Stroudsburg, PA, USA, CONLL ’05, pp 25–32. Ganitkevich, J., Durme, B. V., & Callison-Burch, C. (2013). PPDB: The paraphrase database. In Proceedings of the Human Language Technology Conference (HLT) 2013, North American Chapter of the Association for Computational Linguistics, (pp 758–764). Glinos, D. G. (2014). Discovering Similar Passages within Large Text Documents. In Information Access Evaluation. Multilinguality, Multimodality, and Interaction - 5th International Conference of the CLEF Initiative, CLEF 2014, Sheffield, UK, pp 98–109. International Center for Academic Integrity (2021) The Fundamental Values of Academic Integrity, 3rd Edition. https://www.academicintegrity.org/the-fundamental-values-of-academic-integrity/, Accessed May 2021. Kanjirangat, V., & Gupta, D. (2016). Study on extrinsic text plagiarism detection techniques and tools. Journal of Engineering Science & Technology Review, 9(5), 9–23. Kanjirangat, V., & Gupta, D. (2018). Unmasking text plagiarism using syntactic-semantic based natural language processing techniques: comparisons, analysis and challenges. Information Processing & Management, 54(3), 408–432. Kauffman, Y., & Young, M. F. (2015). Digital plagiarism: an experimental study of the effect of instructional goals and copy-and-Paste affordance. Computers & Education, 83, 44–56. Kopotev, M., Rostovtsev, A., & Sokolov, M. (2021). Shifting the norm: the case of academic plagiarism detection. The Palgrave Handbook of Digital Russia Studies (pp. 483–500). Cham: Palgrave Macmillan. Kumar, N. (2014). A graph based automatic plagiarism detection technique to handle artificial word reordering and paraphrasing. In International Conference on Intelligent Text Processing and Computational Linguistics, Springer International Publishing, (pp 481–494). Madnani, N., Tetreault, J., & Chodorow, M. (2012). Re-examining machine translation metrics for paraphrase identification. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, (pp 182–190). Maurer, H. A., Kappe, F., & Zaka, B. (2006). Plagiarism-a survey. Journal of Universal Computer Science, 12(8), 1050–1084. McKeever, L. (2006). Online plagiarism detection services - saviour or scourge? Assessment & Evaluation in Higher Education, 31(2), 155–165. Meuschke, N., & Gipp, B. (2013). State-of-the-art in detecting academic plagiarism. International Journal for Educational Integrity, 9(1), 50–71. Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., & Joulin, A. (2018). Advances in pre-training distributed word representations. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), (pp 52–55). Moritz, M., Hellrich, J., Büchel, S. (2018). A method for human-interpretable paraphrasticality prediction. In Proceedings of the Second Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, (pp 113–118). Mphahlele, A., & McKenna, S. (2019). The use of turnitin in the higher education sector: decoding the myth. Assessment & Evaluation in Higher Education, 44(7), 1079–1089. Nichols, L., Dewey, K., Emre, M., Chen, S., & Hardekopf, B. (2019). Syntax-based improvements to plagiarism detectors and their evaluations. In Proceedings of the 2019 ACM Conference on Innovation and Technology in Computer Science Education, Association of Computing Machinery. Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global Vectors for Word Representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), vol 14, pp 1532–1543. Potthast, M., Barrón-Cedeno, A., Stein, B., & Rosso, P. (2011). Cross-language plagiarism detection. Language Resources and Evaluation, 45(1), 45–62. Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., & Stein, B. (2014), Improving the Reproducibility of PAN’s Shared Tasks: Plagiarism Detection, Author Identification, and Author Profiling. In Information Access Evaluation. Multilinguality, Multimodality, and Interaction, Springer International Publishing, (pp 268–299) Potthast, M., Goering, S., Rosso, P., & Stein, B. (2015). Towards data submissions for shared tasks: first experiences for the task of text alignment. In Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France, September 8-11, 2015. Sanchez-Perez, M. (2018). Plagiarism detection through paraphrase recognition. PhD thesis, Instituto Politécnico Nacional, Mexico. Sanchez-Perez, M., Sidorov, G., & Gelbukh, A. (2014). A winning approach to text alignment for text reuse detection at PAN 2014 – Notebook for PAN at CLEF 2014. Working Notes for CLEF 2014 Conference, Sheffield, UK pp 1004–1011. Sánchez-Vega, F., Villatoro-Tello, E., Montes-y Gómez, M., Rosso, P., Stamatatos, E., & Villaseñor-Pineda, L. (2017). Paraphrase plagiarism identification with character-level features. Pattern Analysis and Applications pp 669–681. Schmidt Hanbidge, A., Tin, T., & Tsang, H. (2020). Academic integrity matters: successful learning with mobile technology. In International Conference on Interactive Collaborative Learning, Springer International Publishing, (pp 966–977). Sousa-Silva, R. (2014). Investigating academic plagiarism: a forensic linguistics approach to plagiarism detection. International Journal for Educational Integrity, 10(1), 31–41. Speer, R., & Lowry-Duda, J. (2017). ConceptNet at SemEval-2017 Task 2: extending word embeddings with multilingual relational knowledge. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Association for Computational Linguistics. Speer, R., Chin, J., & Havasi, C. (2017), ConceptNet 5.5: an open multilingual graph of general knowledge. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4–9. (2017). San Francisco (pp. 4444–4451). USA: California. Sultan, M. A., Bethard, S., & Sumner, T. (2014). Back to basics for monolingual alignment: exploiting word similarity and contextual evidence. Transactions of the Association for Computational Linguistics, 2, 219–230. Sun, Y. C., & Yang, F. Y. (2015). Uncovering published authors’ text-borrowing practices: paraphrasing strategies, sources, and self-plagiarism. Journal of English for Academic Purposes. pp. 224–236. Tiedemann, J. (2011). Bitext alignment. Synthesis Lectures on Human Language Technologies, 4(2), 1–165. Vila, M., Martí, M. A., Rodríguez, H., et al. (2014). Is this a paraphrase? what kind? paraphrase boundaries and typology. Open Journal of Modern Linguistics, 4(01), 205–218. Wang, X., Chen, Y.Y., Zhao, H., Lu, B.L. (2013). Labeled alignment for recognizing textual entailment. In Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP) 2013, Asian Federation of Natural Language Processing, (pp 605–613). Wang, Y., Hou, Y., Che, W., & Liu, T. (2020). From static to dynamic word representations: a survey. International Journal of Machine Learning and Cybernetics pp 1–20. Weber-Wulff, D. (2014). Plagiarism and academic misconduct. False Feathers: A Perspective on Academic Plagiarism (pp. 3–27). Berlin Heidelberg: Springer. Wise, M. J. (1995). Neweyes: a system for comparing biological sequences using the running Karp-Rabin greedy string-tiling algorithm. InProceedings of the Third International Conference on Intelligent Systems for Molecular Biology, Cambridge, United Kingdom, July 16-19, 1995, (pp 393–401). Zhao, S., Wang, H., Liu, T., Li, S. (2008). Pivot approach for extracting paraphrase patterns from bilingual corpora. In Proceedings of the Human Language Technology Conference (HLT) 2008, Association for Computational Linguistics, (pp 780–788).