Complex Word Identification in Vietnamese: Towards Vietnamese Text Simplification

Phuong Nguyen, David Kauchak


Abstract
Text Simplification has been an extensively researched problem in English, but has not been investigated in Vietnamese. We focus on the Vietnamese-specific Complex Word Identification task, often the first step in Lexical Simplification (Shardlow, 2013). We examine three different Vietnamese datasets constructed for other Natural Language Processing tasks and show that, like in other languages, frequency is a strong signal in determining whether a word is complex, with a mean accuracy of 86.87%. Across the datasets, we find that the 10% most frequent words in many corpus can be labelled as simple, and the rest as complex, though this is more variable for smaller corpora. We also examine how human annotators perform at this task. Given the subjective nature, there is a fair amount of variability in which words are seen as difficult, though majority results are more consistent.
Anthology ID:
2022.mia-1.6
Volume:
Proceedings of the Workshop on Multilingual Information Access (MIA)
Month:
July
Year:
2022
Address:
Seattle, USA
Editors:
Akari Asai, Eunsol Choi, Jonathan H. Clark, Junjie Hu, Chia-Hsuan Lee, Jungo Kasai, Shayne Longpre, Ikuya Yamada, Rui Zhang
Venue:
MIA
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
59–68
Language:
URL:
https://aclanthology.org/2022.mia-1.6
DOI:
10.18653/v1/2022.mia-1.6
Bibkey:
Cite (ACL):
Phuong Nguyen and David Kauchak. 2022. Complex Word Identification in Vietnamese: Towards Vietnamese Text Simplification. In Proceedings of the Workshop on Multilingual Information Access (MIA), pages 59–68, Seattle, USA. Association for Computational Linguistics.
Cite (Informal):
Complex Word Identification in Vietnamese: Towards Vietnamese Text Simplification (Nguyen & Kauchak, MIA 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-4/2022.mia-1.6.pdf