Abstract
In this paper, we present the first empirical study for Vietnamese disfluency detection. To conduct this study, we first create a disfluency detection dataset for Vietnamese, with manual annotations over two disfluency types. We then empirically perform experiments using strong baseline models, and find that: automatic Vietnamese word segmentation improves the disfluency detection performances of the baselines, and the highest performance results are obtained by fine-tuning pre-trained language models in which the monolingual model PhoBERT for Vietnamese does better than the multilingual model XLM-R.- Anthology ID:
- 2022.wnut-1.21
- Volume:
- Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022)
- Month:
- October
- Year:
- 2022
- Address:
- Gyeongju, Republic of Korea
- Venue:
- WNUT
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 194–200
- Language:
- URL:
- https://aclanthology.org/2022.wnut-1.21
- DOI:
- Cite (ACL):
- Mai Hoang Dao, Thinh Hung Truong, and Dat Quoc Nguyen. 2022. Disfluency Detection for Vietnamese. In Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022), pages 194–200, Gyeongju, Republic of Korea. Association for Computational Linguistics.
- Cite (Informal):
- Disfluency Detection for Vietnamese (Dao et al., WNUT 2022)
- PDF:
- https://preview.aclanthology.org/proper-vol2-ingestion/2022.wnut-1.21.pdf
- Code
- vinairesearch/phodisfluency