Vietnamese treebank construction and entropy-based error detection

Springer Science and Business Media LLC - Tập 49 - Trang 487-519 - 2015

Phuong-Thai Nguyen¹, Anh-Cuong Le¹, Tu-Bao Ho², Van-Hiep Nguyen³

¹Trường Đại học Công nghệ, Đại học Quốc Gia Hà Nội

²Japan Advanced Institute of Science and Technology, Nomi, Japan

³Institute of Linguistics, Vietnam Academy of Social Sciences, Hanoi, Vietnam

Tóm tắt

Treebanks, especially the Penn treebank for natural language processing (NLP) in English, play an essential role in both research into and the application of NLP. However, many languages still lack treebanks and building a treebank can be very complicated and difficult. This work has a twofold objective. Firstly, to share our results in constructing a large Vietnamese treebank (VTB) with three levels of annotation including word segmentation, part-of-speech tagging, and syntactic analysis. Major steps in the treebank construction process are described with particular regard to specific Vietnamese properties such as lack of word delimiter and isolation. Those properties make sentences highly syntactically ambiguous, and therefore it is difficult to ensure a high level of agreement among annotators. Various studies of Vietnamese syntax were employed not only to define annotations but also to systematically deal with ambiguities. Annotators were supported by automatic labelling tools, which are based on statistical machine learning methods, for sentence pre-processing and a tree editor for supporting manual annotation. As a result, an annotation agreement of around 90 % was achieved. Our second objective is to present our method for automatically finding errors and inconsistencies in treebank corpora and its application to the construction of the VTB. This method employs the Shannon entropy measure in a manner that the more reduced entropy the more corrected errors in a treebank. The method ranks error candidates by using a scoring function based on conditional entropy. Our experiments showed that this method detected high-error-density subsets of original error candidate sets, and that the corpus entropy was significantly reduced after error correction. The size of these subsets was only about one third of the whole set, while these subsets contained 80–90 % of the total errors. This method can also be applied to languages similar to Vietnamese.

Tài liệu tham khảo

Awate, S. P., & Whitaker, R. T. (2006). Unsupervised, information-theoretic, adaptive image filtering for image restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28, 364–376. Berger, A., Pietra, S. D., & Pietra, V. D. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, 22(1), 39–71. Black, E., Abney, S., Flickenger, D., Gdaniec, C., Grishman, R., Harrison, P., et al. (1991). A procedure for quantitatively comparing the syntactic coverage of English grammars. In Proceedings of DARPA speech and natural language workshop. Cao, X.-H. (2007). The Vietnamese language: Phonetics, syntax, and semantics [in Vietnamese]. Cambridge: Education Press. Chiang, D., & Bikel, D. M. (2002). Recovering latent information in treebanks. In Proceedings of COLING. Collins, M. (1999). Head-driven statistical models for natural language parsing. Ph.D. thesis, University of Pennsylvania. Cover, T. M., & Thomas, J. A. (2006). Elements of information theory. New York: Wiley. Dickinson, M., & Meurers, W. D. (2003). Detecting errors in part-of-speech annotation. In Proceedings of EACL. Dickinson, M. (2006). From detecting errors to automatically correcting them. In Proceedings of EACL. Dickinson, M. (2008). Ad hoc treebank structures. In Proceedings of ACL. Diep, Q.-B. (2005). Vietnamese syntax [in Vietnamese]. Cambridge: Education Press. Han, C., Han, N., Ko, E., & Palmer, M. (2002). Development and evaluation of a Korean treebank and its application to NLP. In Proceedings of LREC. Johnson, M. (1998). PCFG models of linguistic tree representation. Computational Linguistics, 24, 613–632. Jurafsky, D., & Martin, J. H. (2009). Speech and language processing: An introduction to natural language processing., Computational linguistics and speech recognition New Jersey: Prentice Hall. Klein, D., & Manning, C. D. (2003). Accurate unlexicalized parsing. In Proceedings of ACL. Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML. Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19, 313–330. Mitchell, T. M. (1997). Machine learning. Maidenhead: McGraw-Hill. Miyao, Y., & Tsujii, J. (2008). Feature forest models for probabilistic HPSG parsing. Computational Linguistics, 34, 35–80. Nguyen, V.-H. (2009). Vietnamese syntax [in Vietnamese]. Cambridge: Education Press. Nguyen, T.-M.-H., Vu, X.-L., Le, & H.-P. (2003). A case study of the probabilistic tagger QTAG for tagging Vietnamese texts [in Vietnamese]. In Proceedings of ICT.rda. Nguyen, T.-C. (2004). Vietnamese syntax [in Vietnamese]. Hanoi: Vietnam National University Press. Nguyen, P.-T., Vu, X. L., Nguyen, T. M. H., Nguyen, V. H., & Le, H. P. (2009). Building a large syntactically-annotated corpus of Vietnamese. In Proceedings of LAW-3, ACL-IJCNLP. Nguyen, V.-H. (2009). The history of approaches in describing Vietnamese syntax. Journal of the Research Institute for World Languages, (1), 19–34 Novak, V., & Razimova, M. (2009). Unsupervised detection of annotation inconsistencies using apriori algorithm. In Proceedings of LAW-3, ACL-IJCNLP. Pajas, P., & Stepanek, J. (2008). Recent advances in a feature-rich framework for treebank annotation. In Proceedings of COLING. Phuong, L. H., Huyen, N. T. M., Azim, R., & Vinh, H. T. (2008). A hybrid approach to word segmentation of vietnamese texts. In Proceedings of the 2nd international conference on language and automata theory and applications. Springer LNCS 5196, Tarragona, Spain, 2008. Rambow, O. (2010). The simple truth about dependency and phrase structure representations: An opinion piece. In Proceedings of NAACL. Santorini, B. (1990). Part-of-speech tagging guidelines for the Penn Treebank Project. In Treebank-3 Documents. Linguistic Data Consortium. Sciullo, A. M. D., & Williams, E. (1987). On the definition of word. Cambridge: The MIT Press. Steedman, M., Osborne, M., Sarkar, A., Clark, S., Hwa, R., Hockenmaier, J., et al. (2003). Bootstrapping statistical parsers from small datasets. In Proceedings of EACL. Thompson, L. C. (1987). A Vietnamese reference grammar. Hawaii: University of Hawaii Press. van Halteren, H. (2000). The detection of inconsistency in manually tagged text. In Proceedings of LINC. Xue, N., Xia, F., Chiou, F.-D., & Palmer, M. (2005). The Penn Chinese TreeBank: Phrase structure annotation of a large corpus. Natural Language Engineering, 11, 207–238. Yamada, H., & Matsumoto, Y. (2003). Statistical dependency analysis with support vector machines. In Proceedings of IWPT. Yates, A., Schoenmackers, S., & Etzioni, O. (2006). Detecting parser errors using web-based semantic filters. In Proceedings of EMNLP.

Scholar Hub - Công cụ hỗ trợ trích dẫn và phân tích khoa học Việt Nam

Scholar Hub là công cụ hỗ trợ trích dẫn và phân tích ảnh hưởng của các bài báo, công bố khoa học Việt Nam và Quốc tế.
ScholarHub KHÔNG đăng thông tin tổng hợp, KHÔNG đăng lại nội dung từ các trang báo chí Việt Nam hoặc trang thông tin điện tử khác tại Việt Nam.

Thông tin, cập nhật

Đăng ký Tạp chí tham gia Scholar Hub

Phản hồi ý kiến về Scholar Hub

Bài viết, nội dung cập nhật

Chủ đề khoa học

Website liên kết

Hệ thống CSDL Khoa học & Công nghệ SciBase

Phần mềm kiểm tra trùng lặp Kiểm Tra Tài Liệu

Phần mềm xuất bản tạp chí điện tử VOJS

Hệ thống hội thảo khoa học Việt Nam

Nền tảng trắc nghiệm và đề thi đa lĩnh vực LetQA

Thông tin liên hệ & hỗ trợ

Đơn vị chủ quản, phát triển và vận hành: Công ty Cổ phần Metis

Địa chỉ liên hệ: 26A Lê Đức Thọ, Phường Từ Liêm, Thành phố Hà Nội

Số giấy chứng nhận ĐKKD: 0109293202 cấp ngày 03/08/2020 tại Sở Kế hoạch và Đầu tư thành phố Hà Nội

Người quản lý và chịu trách nhiệm nội dung: Nguyễn Ngọc Sơn

Hotline: 0566.685.688

Email: [email protected]