MarianCG: a code generation transformer model inspired by machine translation
Tóm tắt
The idea that computers can build their own programs is extremely significant, and many researchers are working on this challenge. Code generation is described as the process of generating executable code that can be run directly on the computer and fulfills the natural language requirements. It is an intriguing topic that might assist developers to learn a new software technology or programming language, or it could be a simple technique to help in coding through the description of the natural language code developer. In this paper, we present MarianCG, a code generation Transformer model used to tackle the code generation challenge of generating python code from natural language descriptions. Marian neural machine translation (NMT), which is the core model of the Microsoft Translator, is the basis for our NL-to-Code translation engine and is the heart of the teaching model. MarianMT is the teacher language model in our study, and it is one of the most successful machine translation transformers. In our approach, we use a sinusoidal positional embedding technique to represent the position of each token in the text, as well as no layer normalization embedding. Our code generation approach, MarianCG, is based on fine-tuning a machine translation pre-trained language model. This allows us to demonstrate that the pre-trained translation model can also operate and work as a code generation model. The proposed model outperforms recent state-of-the-art models in the problem of code generation when trained on the CoNaLa and DJANGO datasets. MarianCG model scores a BLEU score of 34.43 and an exact match accuracy of 10.2% on the CoNaLa dataset. Also, this model records a BLEU score of 90.41 and an exact match accuracy of 81.83% on the DJANGO dataset. The implementation of MarianCG model and relevant resources are available at
Từ khóa
Tài liệu tham khảo
Le TH, Chen H, Babar MA (2020) Deep learning for source code modeling and generation: models, applications, and challenges. ACM Comput Surv (CSUR) 53(3):1–38
Han X, Zhang Z, Ding N, Gu Y, Liu X, Huo Y, Qiu J, Yao Y, Zhang A, Zhang L, Han W, Huang M, Jin Q, Lan Y, Liu Y, Liu Z, Lu Z, Qiu X, Song R, Tang J, Wen JR, Yuan J, Zhao WX, Zhu J (2021) Pre-trained models: past, present and future. AI Open 2:225–250. https://doi.org/10.1016/j.aiopen.2021.08.002
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, pp 2227–2237. https://doi.org/10.18653/v1/N18-1202
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Gu Y, Han X, Liu Z, Huang M (2022) PPT: Pre-trained prompt tuning for few-shot learning. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, vol 1: Long Papers. Association for Computational Linguistics, Dublin, p 8410–8423. https://doi.org/10.18653/v1/2022.acl-long.576
Ding N, Qin Y, Yang G, Wei F, Yang Z, Su Y, Hu S, Chen Y, Chan CM, Chen W, Yi J, Zhao W, Wang X, Liu Z, Zheng H, Chen J, Liu Y, Tang J, Li J, Sun M (2022) Delta tuning: a comprehensive study of parameter efficient methods for pre-trained language models. ArXiv arxiv:2203.06904
Qin Y, Zhang J, Lin Y, Liu Z, Li P, Sun M, Zhou J (2022) ELLE: Efficient lifelong pre-training for emerging data. In: Findings of the Association for Computational Linguistics: ACL 2022. Association for Computational Linguistics, Dublin, p 2789–2810. https://doi.org/10.18653/v1/2022.findings-acl.220
Phuong M, Hutter M (2022) Formal algorithms for transformers. ArXiv arxiv:2207.09238
Brown TB, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, et al (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I et al (2019) Language models are unsupervised multitask learners. OpenAI blog 1(8):9
Shin R, Lin CH, Thomson S, Chen C, Roy S, Platanios EA, Pauls A, Klein D, Eisner J, Van Durme B (2021) Constrained language models yield few-shot semantic parsers. arXiv preprint arXiv:2104.08768
Marianmt model. https://www.huggingface.co/docs/transformers/model_doc/marian. Accessed Oct 2021
Junczys-Dowmunt M, Grundkiewicz R, Dwojak T, Hoang H, Heafield K, Neckermann T, Seide F, Germann U, Aji AF, Bogoychev N, et al (2018) Marian: fast neural machine translation in c++. arXiv preprint arXiv:1804.00344
Yin P, Deng B, Chen E, Vasilescu B, Neubig G (2018) Learning to mine aligned code and natural language pairs from stack overflow. In: 2018 ieee/acm 15th international conference on mining software repositories (msr). IEEE
Oda Y, Fudaba H, Neubig G, Hata H, Sakti S, Toda T, Nakamura S (2015) Learning to generate pseudo-code from source code using statistical machine translation. In: 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). pp 574–584. https://doi.org/10.1109/ASE.2015.36
Papineni K, Roukos S, Ward T, Zhu WJ (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, p 311–318. https://doi.org/10.3115/1073083.1073135
Dong L, Lapata M (2016) Language to logical form with neural attention. arXiv preprint arXiv:1601.01280
Yin P, Neubig G (2017) A syntactic neural model for general-purpose code generation. arXiv preprint arXiv:1704.01696
Rabinovich M, Stern M, Klein D (2017) Abstract syntax networks for code generation and semantic parsing. arXiv preprint arXiv:1704.07535
Yin P, Neubig G (2018) Tranx: A transition-based neural abstract syntax parser for semantic parsing and code generation. arXiv preprint arXiv:1810.02720
Yin P, Neubig G (2019) Reranking for neural semantic parsing. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, p 4553–4559. https://doi.org/10.18653/v1/P19-1447
Shin EC, Allamanis M, Brockschmidt M, Polozov A (2019) Program synthesis and semantic parsing with learned code idioms. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems (NeurIPS 2019). Advances in Neural Information Processing Systems, Vancouver, p 10825–10835. https://dl.acm.org/doi/10.5555/3454287.3455258
Sun Z, Zhu Q, Xiong Y, Sun Y, Mou L, Zhang L (2020) Treegen: a tree-based transformer architecture for code generation. Proceedings of the AAAI Conference on Artificial Intelligence, vol 34 No. 05. AAAI-20 Technical Tracks 5, Palo Alto, p 8984-8991. https://doi.org/10.1609/aaai.v34i05.6430
Xu FF, Jiang Z, Yin P, Vasilescu B, Neubig G (2020) Incorporating external knowledge through pre-training for natural language to code generation. arXiv preprint arXiv:2004.09015
Dahal S, Maharana A, Bansal M (2021) Analysis of tree-structured architectures for code generation. In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Bangkok, p 4382–4391. https://doi.org/10.18653/v1/2021.findings-acl.384
Norouzi S, Tang K, Cao Y (2021) Code generation from natural language with less prior knowledge and more monolingual data. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, vol 2: Short Papers. Association for Computational Linguistics, Bangkok, p 776–785. https://doi.org/10.18653/v1/2021.acl-short.98
Orlanski G, Gittens A (2021) Reading stackoverflow encourages cheating: adding question text improves extractive code generation. arXiv preprint arXiv:2106.04447
Beau N, Crabbé B (2022) The impact of lexical and grammatical processing on generating code from natural language. arXiv preprint arXiv:2202.13972
Wang Z, Cuenca G, Zhou S, Xu FF, Neubig G (2022) Mconala: a benchmark for code generation from multiple natural languages. arXiv preprint arXiv:2203.08388
Kusupati U, Ailavarapu VRT (2022) Natural language to code using transformers. ArXiv arxiv:2202.00367
Al-Hossami E, Shaikh S (2022) A survey on artificial intelligence for source code: a dialogue systems perspective. ArXiv arxiv:2202.04847
Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078
Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L (2019) Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461
Subramanyam Kalyan K, Rajasekharan A, Sangeetha S (2021) Ammus: a survey of transformer-based pretrained models in natural language processing. arXiv e-prints arXiv–2108
Kudo T, Richardson J (2018) Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226
Kudo T (2018) Subword regularization: improving neural network translation models with multiple subword candidates. arXiv preprint arXiv:1804.10959
Sennrich R, Haddow B, Birch A (2015) Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems 30 (NIPS 2017), Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA. Curran Associates, Inc., p 5998–6008. https://papers.nips.cc/paper/7181-attention-is-all-you-need
Alammar J (2018) The illustrated transformer. http://jalammar.github.io/illustrated-transformer/. Accessed May 2021
Ling W, Blunsom P, Grefenstette E, Hermann KM, Kočiský T, Wang F, Senior A (2016) Latent predictor networks for code generation. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, pp 599–609. https://doi.org/10.18653/v1/P16-1057