MarianCG: a code generation transformer model inspired by machine translation

Ahmed Soliman1, Mayada Hadhoud1, Samir I. Shaheen1
1Department of Computer Engineering, Cairo University, Giza, Egypt

Tóm tắt

Abstract

The idea that computers can build their own programs is extremely significant, and many researchers are working on this challenge. Code generation is described as the process of generating executable code that can be run directly on the computer and fulfills the natural language requirements. It is an intriguing topic that might assist developers to learn a new software technology or programming language, or it could be a simple technique to help in coding through the description of the natural language code developer. In this paper, we present MarianCG, a code generation Transformer model used to tackle the code generation challenge of generating python code from natural language descriptions. Marian neural machine translation (NMT), which is the core model of the Microsoft Translator, is the basis for our NL-to-Code translation engine and is the heart of the teaching model. MarianMT is the teacher language model in our study, and it is one of the most successful machine translation transformers. In our approach, we use a sinusoidal positional embedding technique to represent the position of each token in the text, as well as no layer normalization embedding. Our code generation approach, MarianCG, is based on fine-tuning a machine translation pre-trained language model. This allows us to demonstrate that the pre-trained translation model can also operate and work as a code generation model. The proposed model outperforms recent state-of-the-art models in the problem of code generation when trained on the CoNaLa and DJANGO datasets. MarianCG model scores a BLEU score of 34.43 and an exact match accuracy of 10.2% on the CoNaLa dataset. Also, this model records a BLEU score of 90.41 and an exact match accuracy of 81.83% on the DJANGO dataset. The implementation of MarianCG model and relevant resources are available at https://www.github.com/AhmedSSoliman/MarianCG-NL-to-Code.

Từ khóa


Tài liệu tham khảo

Le TH, Chen H, Babar MA (2020) Deep learning for source code modeling and generation: models, applications, and challenges. ACM Comput Surv (CSUR) 53(3):1–38

Han X, Zhang Z, Ding N, Gu Y, Liu X, Huo Y, Qiu J, Yao Y, Zhang A, Zhang L, Han W, Huang M, Jin Q, Lan Y, Liu Y, Liu Z, Lu Z, Qiu X, Song R, Tang J, Wen JR, Yuan J, Zhao WX, Zhu J (2021) Pre-trained models: past, present and future. AI Open 2:225–250. https://doi.org/10.1016/j.aiopen.2021.08.002

Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, pp 2227–2237. https://doi.org/10.18653/v1/N18-1202

Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805

Gu Y, Han X, Liu Z, Huang M (2022) PPT: Pre-trained prompt tuning for few-shot learning. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, vol 1: Long Papers. Association for Computational Linguistics, Dublin, p 8410–8423. https://doi.org/10.18653/v1/2022.acl-long.576

Ding N, Qin Y, Yang G, Wei F, Yang Z, Su Y, Hu S, Chen Y, Chan CM, Chen W, Yi J, Zhao W, Wang X, Liu Z, Zheng H, Chen J, Liu Y, Tang J, Li J, Sun M (2022) Delta tuning: a comprehensive study of parameter efficient methods for pre-trained language models. ArXiv arxiv:2203.06904

Qin Y, Zhang J, Lin Y, Liu Z, Li P, Sun M, Zhou J (2022) ELLE: Efficient lifelong pre-training for emerging data. In: Findings of the Association for Computational Linguistics: ACL 2022. Association for Computational Linguistics, Dublin, p 2789–2810. https://doi.org/10.18653/v1/2022.findings-acl.220

Phuong M, Hutter M (2022) Formal algorithms for transformers. ArXiv arxiv:2207.09238

Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, et al (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165

Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9

Shin R, Lin CH, Thomson S, Chen C, Roy S, Platanios EA, Pauls A, Klein D, Eisner J, Van Durme B (2021) Constrained language models yield few-shot semantic parsers. arXiv preprint arXiv:2104.08768

Marianmt model. https://www.huggingface.co/docs/transformers/model_doc/marian. Accessed Oct 2021

Junczys-Dowmunt M, Grundkiewicz R, Dwojak T, Hoang H, Heafield K, Neckermann T, Seide F, Germann U, Aji AF, Bogoychev N, et al (2018) Marian: fast neural machine translation in c++. arXiv preprint arXiv:1804.00344

Yin P, Deng B, Chen E, Vasilescu B, Neubig G (2018) Learning to mine aligned code and natural language pairs from stack overflow. In: 2018 ieee/acm 15th international conference on mining software repositories (msr). IEEE

Oda Y, Fudaba H, Neubig G, Hata H, Sakti S, Toda T, Nakamura S (2015) Learning to generate pseudo-code from source code using statistical machine translation. In: 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). pp 574–584. https://doi.org/10.1109/ASE.2015.36

Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, p 311–318. https://doi.org/10.3115/1073083.1073135

Dong L, Lapata M (2016) Language to logical form with neural attention. arXiv preprint arXiv:1601.01280

Yin P, Neubig G (2017) A syntactic neural model for general-purpose code generation. arXiv preprint arXiv:1704.01696

Rabinovich M, Stern M, Klein D (2017) Abstract syntax networks for code generation and semantic parsing. arXiv preprint arXiv:1704.07535

Yin P, Neubig G (2018) Tranx: A transition-based neural abstract syntax parser for semantic parsing and code generation. arXiv preprint arXiv:1810.02720

Yin P, Neubig G (2019) Reranking for neural semantic parsing. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, p 4553–4559. https://doi.org/10.18653/v1/P19-1447

Shin EC, Allamanis M, Brockschmidt M, Polozov A (2019) Program synthesis and semantic parsing with learned code idioms. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS 2019). Advances in Neural Information Processing Systems, Vancouver, p 10825–10835. https://dl.acm.org/doi/10.5555/3454287.3455258

Sun Z, Zhu Q, Xiong Y, Sun Y, Mou L, Zhang L (2020) Treegen: a tree-based transformer architecture for code generation. Proceedings of the AAAI Conference on Artificial Intelligence, vol 34 No. 05. AAAI-20 Technical Tracks 5, Palo Alto, p 8984-8991. https://doi.org/10.1609/aaai.v34i05.6430

Xu FF, Jiang Z, Yin P, Vasilescu B, Neubig G (2020) Incorporating external knowledge through pre-training for natural language to code generation. arXiv preprint arXiv:2004.09015

Dahal S, Maharana A, Bansal M (2021) Analysis of tree-structured architectures for code generation. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Bangkok, p 4382–4391. https://doi.org/10.18653/v1/2021.findings-acl.384

Norouzi S, Tang K, Cao Y (2021) Code generation from natural language with less prior knowledge and more monolingual data. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, vol 2: Short Papers. Association for Computational Linguistics, Bangkok, p 776–785. https://doi.org/10.18653/v1/2021.acl-short.98

Orlanski G, Gittens A (2021) Reading stackoverflow encourages cheating: adding question text improves extractive code generation. arXiv preprint arXiv:2106.04447

Beau N, Crabbé B (2022) The impact of lexical and grammatical processing on generating code from natural language. arXiv preprint arXiv:2202.13972

Wang Z, Cuenca G, Zhou S, Xu FF, Neubig G (2022) Mconala: a benchmark for code generation from multiple natural languages. arXiv preprint arXiv:2203.08388

Kusupati U, Ailavarapu VRT (2022) Natural language to code using transformers. ArXiv arxiv:2202.00367

Al-Hossami E, Shaikh S (2022) A survey on artificial intelligence for source code: a dialogue systems perspective. ArXiv arxiv:2202.04847

Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078

Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L (2019) Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461

Subramanyam Kalyan K, Rajasekharan A, Sangeetha S (2021) Ammus: a survey of transformer-based pretrained models in natural language processing. arXiv e-prints arXiv–2108

Kudo T, Richardson J (2018) Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226

Kudo T (2018) Subword regularization: improving neural network translation models with multiple subword candidates. arXiv preprint arXiv:1804.10959

Sennrich R, Haddow B, Birch A (2015) Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems 30 (NIPS 2017), Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA. Curran Associates, Inc., p 5998–6008. https://papers.nips.cc/paper/7181-attention-is-all-you-need

Alammar J (2018) The illustrated transformer. http://jalammar.github.io/illustrated-transformer/. Accessed May 2021

Ling W, Blunsom P, Grefenstette E, Hermann KM, Kočiský T, Wang F, Senior A (2016) Latent predictor networks for code generation. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, pp 599–609. https://doi.org/10.18653/v1/P16-1057