Challenges and Solutions for Consistent Annotation of Vietnamese Treebank

Quy Nguyen, Yusuke Miyao, Ha Le, Ngan Nguyen


Abstract
Treebanks are important resources for researchers in natural language processing, speech recognition, theoretical linguistics, etc. To strengthen the automatic processing of the Vietnamese language, a Vietnamese treebank has been built. However, the quality of this treebank is not satisfactory and is a possible source for the low performance of Vietnamese language processing. We have been building a new treebank for Vietnamese with about 40,000 sentences annotated with three layers: word segmentation, part-of-speech tagging, and bracketing. In this paper, we describe several challenges of Vietnamese language and how we solve them in developing annotation guidelines. We also present our methods to improve the quality of the annotation guidelines and ensure annotation accuracy and consistency. Experiment results show that inter-annotator agreement ratios and accuracy are higher than 90% which is satisfactory.
Anthology ID:
L16-1243
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1532–1539
Language:
URL:
https://aclanthology.org/L16-1243
DOI:
Bibkey:
Cite (ACL):
Quy Nguyen, Yusuke Miyao, Ha Le, and Ngan Nguyen. 2016. Challenges and Solutions for Consistent Annotation of Vietnamese Treebank. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 1532–1539, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
Challenges and Solutions for Consistent Annotation of Vietnamese Treebank (Nguyen et al., LREC 2016)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp22-frontmatter/L16-1243.pdf